Options¶
Bases: FileDFWriteOptions, GenericOptions
Options for :obj:FileDFWriter <onetl.file.file_df_writer.file_df_writer.FileDFWriter>.
.. versionadded:: 0.9.0
Examples:
.. note::
You can pass any value `supported by Spark <https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html>`_,
even if it is not mentioned in this documentation. **Option names should be in** ``camelCase``!
The set of supported options depends on Spark version.
.. code:: python
from onetl.file import FileDFWriter
options = FileDFWriter.Options(
if_exists="replace_overlapping_partitions",
partitionBy="month",
)
if_exists = FileDFExistBehavior.APPEND
class-attribute
instance-attribute
¶
Behavior for existing target directory.
If target directory does not exist, it will be created. But if it does exist, then behavior is different for each value.
.. versionchanged:: 0.13.0
Default value was changed from ``error`` to ``append``
Possible values:
-
errorIf folder already exists, raise an exception.Same as Spark's
df.write.mode("error").save(). -
skip_entire_directoryIf folder already exists, left existing files intact and stop immediately without any errors.Same as Spark's
df.write.mode("ignore").save(). -
append(default) Appends data into existing directory... dropdown:: Behavior in details
* Directory does not exist Directory is created using all the provided options (``format``, ``partition_by``, etc). * Directory exists, does not contain partitions, but :obj:`~partition_by` is set Data is appended to a directory, but to partitioned directory structure. .. warning:: Existing files still present in the root of directory, but Spark will ignore those files while reading, unless using ``recursive=True``. * Directory exists and contains partitions, but :obj:`~partition_by` is not set Data is appended to a directory, but to the root of directory instead of nested partition directories. .. warning:: Spark will ignore such files while reading, unless using ``recursive=True``. * Directory exists and contains partitions, but with different partitioning schema than :obj:`~partition_by` Data is appended to a directory with new partitioning schema. .. warning:: Spark cannot read directory with multiple partitioning schemas, unless using ``recursive=True`` to disable partition scanning. * Directory exists and partitioned according :obj:`~partition_by`, but partition is present only in dataframe New partition directory is created. * Directory exists and partitioned according :obj:`~partition_by`, partition is present in both dataframe and directory New files are added to existing partition directory, existing files are sill present. * Directory exists and partitioned according :obj:`~partition_by`, but partition is present only in directory, not dataframe Existing partition is left intact. -
replace_overlapping_partitionsIf partitions from dataframe already exist in directory structure, they will be overwritten.Same as Spark's
df.write.option("partitionOverwriteMode", "dynamic").mode("overwrite").save()... DANGER::
This mode does make sense **ONLY** if the directory is partitioned. **IF NOT, YOU'LL LOOSE YOUR DATA!**.. dropdown:: Behavior in details
* Directory does not exist Directory is created using all the provided options (``format``, ``partition_by``, etc). * Directory exists, does not contain partitions, but :obj:`~partition_by` is set Directory **will be deleted**, and will be created with partitions. * Directory exists and contains partitions, but :obj:`~partition_by` is not set Directory **will be deleted**, and will be created with partitions. * Directory exists and contains partitions, but with different partitioning schema than :obj:`~partition_by` Data is appended to a directory with new partitioning schema. .. warning:: Spark cannot read directory with multiple partitioning schemas, unless using ``recursive=True`` to disable partition scanning. * Directory exists and partitioned according :obj:`~partition_by`, but partition is present only in dataframe New partition directory is created. * Directory exists and partitioned according :obj:`~partition_by`, partition is present in both dataframe and directory Partition directory **will be deleted**, and new one is created with files containing data from dataframe. * Directory exists and partitioned according :obj:`~partition_by`, but partition is present only in directory, not dataframe Existing partition is left intact. -
replace_entire_directoryRemove existing directory and create new one, overwriting all existing data. All existing partitions are dropped.Same as Spark's
df.write.option("partitionOverwriteMode", "static").mode("overwrite").save().
.. note::
Unlike using pure Spark, config option ``spark.sql.sources.partitionOverwriteMode``
does not affect behavior of any ``mode``
partition_by = Field(default=None, alias='partitionBy')
class-attribute
instance-attribute
¶
List of columns should be used for data partitioning. None means partitioning is disabled.
Each partition is a folder which contains only files with the specific column value,
like some.csv/col1=value1, some.csv/col1=value2, and so on.
Multiple partitions columns means nested folder structure, like some.csv/col1=val1/col2=val2.
If WHERE clause in the query contains expression like partition = value,
Spark will scan only files in a specific partition.
Examples: reg_id or ["reg_id", "business_dt"]
.. note::
Values should be scalars (integers, strings),
and either static (``countryId``) or incrementing (dates, years), with low
number of distinct values.
Columns like ``userId`` or ``datetime``/``timestamp`` should **NOT** be used for partitioning.
apply_to_writer(writer)
¶
Apply provided format to :obj:pyspark.sql.DataFrameWriter. |support_hooks|
Returns:
-
obj:`pyspark.sql.DataFrameWriter` with options applied–