Skip to content

ORC

Bases: ReadWriteFileFormat

ORC file format (columnar). |support_hooks|

Based on Spark ORC Files <https://spark.apache.org/docs/latest/sql-data-sources-orc.html>_ file format.

Supports reading/writing files with .orc extension.

.. versionadded:: 0.9.0

Examples:

.. note ::

You can pass any option mentioned in
`official documentation <https://spark.apache.org/docs/latest/sql-data-sources-orc.html>`_.
**Option names should be in** ``camelCase``!

The set of supported options depends on Spark version.

You may also set options mentioned
`orc-java documentation <https://orc.apache.org/docs/core-java-config.html>`_.
They are prefixed with ``orc.`` with dots in names,
so instead of calling constructor ``ORC(orc.option=True)`` (invalid in Python)
you should call method ``ORC.parse({"orc.option": True})``.

.. tabs::

.. code-tab:: py Reading files

    from onetl.file.format import ORC

    orc = ORC(mergeSchema=True)

.. tab:: Writing files

    .. code:: python

        from onetl.file.format import ORC

        orc = ORC.parse(
            {
                "compression": "snappy",
                # Enable Bloom filter for columns 'id' and 'name'
                "orc.bloom.filter.columns": "id,name",
                # Set Bloom filter false positive probability
                "orc.bloom.filter.fpp": 0.01,
                # Do not use dictionary for 'highly_selective_column'
                "orc.column.encoding.direct": "highly_selective_column",
                # other options
            }
        )

mergeSchema = None class-attribute instance-attribute

Merge schemas of all ORC files being read into a single schema. By default, Spark config option spark.sql.orc.mergeSchema value is used (False).

.. note::

Used only for reading files.

compression = None class-attribute instance-attribute

Compression codec of the ORC files. By default, Spark config option spark.sql.orc.compression.codec value is used (snappy).

.. note::

Used only for writing files.