Skip to content

FileDF Reader

Bases: FrozenModel

Allows you to read files from a source path with specified file connection and parameters, and return a Spark DataFrame. |support_hooks|

.. warning::

This class does **not** support read strategies.

.. versionadded:: 0.9.0

Parameters:

  • connection (:obj:BaseFileDFConnection <onetl.base.base_file_df_connection.BaseFileDFConnection>) –

    File DataFrame connection. See :ref:file-df-connections section.

  • format (:obj:BaseReadableFileFormat <onetl.base.base_file_format.BaseReadableFileFormat>) –

    File format to read.

  • source_path (PathLike | str, default: `None` ) –

    Directory path to read data from.

    Could be None, but only if you pass file paths directly to :obj:~run method

  • df_schema (:obj:pyspark.sql.types.StructType, default: `None` ) –

    Spark DataFrame schema.

  • options (:obj:FileDFReaderOptions <onetl.file.file_df_reader.options.FileDFReaderOptions>) –

    Common reading options.

Examples:

.. tabs::

.. code-tab:: py Read CSV files from local filesystem

    from onetl.connection import SparkLocalFS
    from onetl.file import FileDFReader
    from onetl.file.format import CSV

    csv = CSV(delimiter=",")
    local_fs = SparkLocalFS(spark=spark)

    reader = FileDFReader(
        connection=local_fs,
        format=csv,
        source_path="/path/to/directory",
    )

.. code-tab:: py All supported options

    from onetl.connection import SparkLocalFS
    from onetl.file import FileDFReader
    from onetl.file.format import CSV

    csv = CSV(delimiter=",")
    local_fs = SparkLocalFS(spark=spark)

    reader = FileDFReader(
        connection=local_fs,
        format=csv,
        source_path="/path/to/directory",
        options=FileDFReader.Options(recursive=False),
    )

run(files=None)

Method for reading files as DataFrame. |support_hooks|

.. versionadded:: 0.9.0

Parameters:

  • files (Iterator[str | PathLike] | None, default: `None` ) –

    File list to read.

    If empty, read files from source_path.

Returns:

  • df ( :obj:pyspark.sql.DataFrame ) –

    Spark DataFrame

Examples:

Read CSV files from directory /path:

.. code:: python

from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV

csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)

reader = FileDFReader(
    connection=local_fs,
    format=csv,
    source_path="/path",
)
df = reader.run()

Read some CSV files using file paths:

.. code:: python

from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV

csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)

reader = FileDFReader(
    connection=local_fs,
    format=csv,
)

df = reader.run(
    [
        "/path/file1.csv",
        "/path/nested/file2.csv",
    ]
)

Read only specific CSV files in directory:

.. code:: python

from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV

csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)

reader = FileDFReader(
    connection=local_fs,
    format=csv,
    source_path="/path",
)

df = reader.run(
    [
        # file paths could be relative
        "/path/file1.csv",
        "/path/nested/file2.csv",
    ]
)