Skip to content

Parquet

Bases: ReadWriteFileFormat

Parquet file format (columnar). |support_hooks|

Based on Spark Parquet Files <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html>_ file format.

Supports reading/writing files with .parquet extension.

.. versionadded:: 0.9.0

Examples:

.. note ::

You can pass any option mentioned in
`official documentation <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html>`_.
**Option names should be in** ``camelCase``!

The set of supported options depends on Spark version.

You may also set options mentioned
`parquet-hadoop documentation <https://github.com/apache/parquet-java/blob/master/parquet-hadoop/README.md>`_.
They are prefixed with ``parquet.`` with dots in names,
so instead of calling constructor ``Parquet(parquet.option=True)`` (invalid in Python)
you should call method ``Parquet.parse({"parquet.option": True})``.

.. tabs::

.. code-tab:: py Reading files

    from onetl.file.format import Parquet

    parquet = Parquet(mergeSchema=True)

.. code-tab:: py Writing files

    from onetl.file.format import Parquet

    parquet = Parquet.parse(
        {
            "compression": "snappy",
            # Enable Bloom filter for columns 'id' and 'name'
            "parquet.bloom.filter.enabled#id": True,
            "parquet.bloom.filter.enabled#name": True,
            # Set expected number of distinct values for column 'id'
            "parquet.bloom.filter.expected.ndv#id": 10_000_000,
            # other options
        }
    )

mergeSchema = None class-attribute instance-attribute

Merge schemas of all Parquet files being read into a single schema. By default, Spark config option spark.sql.parquet.mergeSchema value is used (false).

.. note::

Used only for reading files.

compression = None class-attribute instance-attribute

Compression codec of the Parquet files. By default, Spark config option spark.sql.parquet.compression.codec value is used (snappy).

.. note::

Used only for writing files.