FileDF Reader¶
Bases: FrozenModel
Allows you to read files from a source path with specified file connection and parameters, and return a Spark DataFrame. |support_hooks|
.. warning::
This class does **not** support read strategies.
.. versionadded:: 0.9.0
Parameters:
-
connection(:obj:BaseFileDFConnection <onetl.base.base_file_df_connection.BaseFileDFConnection>) –File DataFrame connection. See :ref:
file-df-connectionssection. -
format(:obj:BaseReadableFileFormat <onetl.base.base_file_format.BaseReadableFileFormat>) –File format to read.
-
source_path(PathLike | str, default:`None`) –Directory path to read data from.
Could be
None, but only if you pass file paths directly to :obj:~runmethod -
df_schema(:obj:pyspark.sql.types.StructType, default:`None`) –Spark DataFrame schema.
-
options(:obj:FileDFReaderOptions <onetl.file.file_df_reader.options.FileDFReaderOptions>) –Common reading options.
Examples:
.. tabs::
.. code-tab:: py Read CSV files from local filesystem
from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV
csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)
reader = FileDFReader(
connection=local_fs,
format=csv,
source_path="/path/to/directory",
)
.. code-tab:: py All supported options
from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV
csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)
reader = FileDFReader(
connection=local_fs,
format=csv,
source_path="/path/to/directory",
options=FileDFReader.Options(recursive=False),
)
run(files=None)
¶
Method for reading files as DataFrame. |support_hooks|
.. versionadded:: 0.9.0
Parameters:
-
files(Iterator[str | PathLike] | None, default:`None`) –File list to read.
If empty, read files from
source_path.
Returns:
-
df(:obj:pyspark.sql.DataFrame) –Spark DataFrame
Examples:
Read CSV files from directory /path:
.. code:: python
from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV
csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)
reader = FileDFReader(
connection=local_fs,
format=csv,
source_path="/path",
)
df = reader.run()
Read some CSV files using file paths:
.. code:: python
from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV
csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)
reader = FileDFReader(
connection=local_fs,
format=csv,
)
df = reader.run(
[
"/path/file1.csv",
"/path/nested/file2.csv",
]
)
Read only specific CSV files in directory:
.. code:: python
from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV
csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)
reader = FileDFReader(
connection=local_fs,
format=csv,
source_path="/path",
)
df = reader.run(
[
# file paths could be relative
"/path/file1.csv",
"/path/nested/file2.csv",
]
)