Parquet¶
Bases: ReadWriteFileFormat
Parquet file format (columnar). |support_hooks|
Based on Spark Parquet Files <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html>_ file format.
Supports reading/writing files with .parquet extension.
.. versionadded:: 0.9.0
Examples:
.. note ::
You can pass any option mentioned in
`official documentation <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html>`_.
**Option names should be in** ``camelCase``!
The set of supported options depends on Spark version.
You may also set options mentioned
`parquet-hadoop documentation <https://github.com/apache/parquet-java/blob/master/parquet-hadoop/README.md>`_.
They are prefixed with ``parquet.`` with dots in names,
so instead of calling constructor ``Parquet(parquet.option=True)`` (invalid in Python)
you should call method ``Parquet.parse({"parquet.option": True})``.
.. tabs::
.. code-tab:: py Reading files
from onetl.file.format import Parquet
parquet = Parquet(mergeSchema=True)
.. code-tab:: py Writing files
from onetl.file.format import Parquet
parquet = Parquet.parse(
{
"compression": "snappy",
# Enable Bloom filter for columns 'id' and 'name'
"parquet.bloom.filter.enabled#id": True,
"parquet.bloom.filter.enabled#name": True,
# Set expected number of distinct values for column 'id'
"parquet.bloom.filter.expected.ndv#id": 10_000_000,
# other options
}
)
mergeSchema = None
class-attribute
instance-attribute
¶
Merge schemas of all Parquet files being read into a single schema.
By default, Spark config option spark.sql.parquet.mergeSchema value is used (false).
.. note::
Used only for reading files.
compression = None
class-attribute
instance-attribute
¶
Compression codec of the Parquet files.
By default, Spark config option spark.sql.parquet.compression.codec value is used (snappy).
.. note::
Used only for writing files.