ORC¶
Bases: ReadWriteFileFormat
ORC file format (columnar). |support_hooks|
Based on Spark ORC Files <https://spark.apache.org/docs/latest/sql-data-sources-orc.html>_ file format.
Supports reading/writing files with .orc extension.
.. versionadded:: 0.9.0
Examples:
.. note ::
You can pass any option mentioned in
`official documentation <https://spark.apache.org/docs/latest/sql-data-sources-orc.html>`_.
**Option names should be in** ``camelCase``!
The set of supported options depends on Spark version.
You may also set options mentioned
`orc-java documentation <https://orc.apache.org/docs/core-java-config.html>`_.
They are prefixed with ``orc.`` with dots in names,
so instead of calling constructor ``ORC(orc.option=True)`` (invalid in Python)
you should call method ``ORC.parse({"orc.option": True})``.
.. tabs::
.. code-tab:: py Reading files
from onetl.file.format import ORC
orc = ORC(mergeSchema=True)
.. tab:: Writing files
.. code:: python
from onetl.file.format import ORC
orc = ORC.parse(
{
"compression": "snappy",
# Enable Bloom filter for columns 'id' and 'name'
"orc.bloom.filter.columns": "id,name",
# Set Bloom filter false positive probability
"orc.bloom.filter.fpp": 0.01,
# Do not use dictionary for 'highly_selective_column'
"orc.column.encoding.direct": "highly_selective_column",
# other options
}
)
mergeSchema = None
class-attribute
instance-attribute
¶
Merge schemas of all ORC files being read into a single schema.
By default, Spark config option spark.sql.orc.mergeSchema value is used (False).
.. note::
Used only for reading files.
compression = None
class-attribute
instance-attribute
¶
Compression codec of the ORC files.
By default, Spark config option spark.sql.orc.compression.codec value is used (snappy).
.. note::
Used only for writing files.