Skip to content

Options

Bases: FileDFWriteOptions, GenericOptions

Options for :obj:FileDFWriter <onetl.file.file_df_writer.file_df_writer.FileDFWriter>.

.. versionadded:: 0.9.0

Examples:

.. note::

You can pass any value `supported by Spark <https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html>`_,
even if it is not mentioned in this documentation. **Option names should be in** ``camelCase``!

The set of supported options depends on Spark version.

.. code:: python

from onetl.file import FileDFWriter

options = FileDFWriter.Options(
    if_exists="replace_overlapping_partitions",
    partitionBy="month",
)

if_exists = FileDFExistBehavior.APPEND class-attribute instance-attribute

Behavior for existing target directory.

If target directory does not exist, it will be created. But if it does exist, then behavior is different for each value.

.. versionchanged:: 0.13.0

Default value was changed from ``error`` to ``append``

Possible values:

  • error If folder already exists, raise an exception.

    Same as Spark's df.write.mode("error").save().

  • skip_entire_directory If folder already exists, left existing files intact and stop immediately without any errors.

    Same as Spark's df.write.mode("ignore").save().

  • append (default) Appends data into existing directory.

    .. dropdown:: Behavior in details

    * Directory does not exist
        Directory is created using all the provided options (``format``, ``partition_by``, etc).
    
    * Directory exists, does not contain partitions, but :obj:`~partition_by` is set
        Data is appended to a directory, but to partitioned directory structure.
    
        .. warning::
    
            Existing files still present in the root of directory,
            but Spark will ignore those files while reading,
            unless using ``recursive=True``.
    
    * Directory exists and contains partitions, but :obj:`~partition_by` is not set
        Data is appended to a directory, but to the root of
        directory instead of nested partition directories.
    
        .. warning::
    
            Spark will ignore such files while reading, unless using ``recursive=True``.
    
    * Directory exists and contains partitions,
      but with different partitioning schema than :obj:`~partition_by`
        Data is appended to a directory with new partitioning schema.
    
        .. warning::
    
            Spark cannot read directory with multiple partitioning schemas,
            unless using ``recursive=True`` to disable partition scanning.
    
    * Directory exists and partitioned according :obj:`~partition_by`,
      but partition is present only in dataframe
        New partition directory is created.
    
    * Directory exists and partitioned according :obj:`~partition_by`,
      partition is present in both dataframe and directory
        New files are added to existing partition directory, existing files are sill present.
    
    * Directory exists and partitioned according :obj:`~partition_by`,
      but partition is present only in directory, not dataframe
        Existing partition is left intact.
    
  • replace_overlapping_partitions If partitions from dataframe already exist in directory structure, they will be overwritten.

    Same as Spark's df.write.option("partitionOverwriteMode", "dynamic").mode("overwrite").save().

    .. DANGER::

    This mode does make sense **ONLY** if the directory is partitioned.
    **IF NOT, YOU'LL LOOSE YOUR DATA!**
    

    .. dropdown:: Behavior in details

    * Directory does not exist
        Directory is created using all the provided options
        (``format``, ``partition_by``, etc).
    
    * Directory exists, does not contain partitions, but :obj:`~partition_by` is set
        Directory **will be deleted**, and will be created with partitions.
    
    * Directory exists and contains partitions, but :obj:`~partition_by` is not set
        Directory **will be deleted**, and will be created with partitions.
    
    * Directory exists and contains partitions,
      but with different partitioning schema than :obj:`~partition_by`
        Data is appended to a directory with new partitioning schema.
    
        .. warning::
    
            Spark cannot read directory with multiple partitioning schemas,
            unless using ``recursive=True`` to disable partition scanning.
    
    * Directory exists and partitioned according :obj:`~partition_by`,
      but partition is present only in dataframe
        New partition directory is created.
    
    * Directory exists and partitioned according :obj:`~partition_by`,
      partition is present in both dataframe and directory
        Partition directory **will be deleted**,
        and new one is created with files containing data from dataframe.
    
    * Directory exists and partitioned according :obj:`~partition_by`,
      but partition is present only in directory, not dataframe
        Existing partition is left intact.
    
  • replace_entire_directory Remove existing directory and create new one, overwriting all existing data. All existing partitions are dropped.

    Same as Spark's df.write.option("partitionOverwriteMode", "static").mode("overwrite").save().

.. note::

Unlike using pure Spark, config option ``spark.sql.sources.partitionOverwriteMode``
does not affect behavior of any ``mode``

partition_by = Field(default=None, alias='partitionBy') class-attribute instance-attribute

List of columns should be used for data partitioning. None means partitioning is disabled.

Each partition is a folder which contains only files with the specific column value, like some.csv/col1=value1, some.csv/col1=value2, and so on.

Multiple partitions columns means nested folder structure, like some.csv/col1=val1/col2=val2.

If WHERE clause in the query contains expression like partition = value, Spark will scan only files in a specific partition.

Examples: reg_id or ["reg_id", "business_dt"]

.. note::

Values should be scalars (integers, strings),
and either static (``countryId``) or incrementing (dates, years), with low
number of distinct values.

Columns like ``userId`` or ``datetime``/``timestamp`` should **NOT** be used for partitioning.

apply_to_writer(writer)

Apply provided format to :obj:pyspark.sql.DataFrameWriter. |support_hooks|

Returns:

  • obj:`pyspark.sql.DataFrameWriter` with options applied