kedro
kedro copied to clipboard
Expand partitioned dataset documentation
Description
Having recently used PartitionedDataSet for spark+parquet, it wasnt immediately clear how to configure this from documentation. I'd like to see additional examples/more complete documentation of the initialisation arguments
Context
This should speed up others configuring partitioned dataset in similar circumstances
Possible Implementation
Docs to include an example of a spark+parquet partitioned dataset, specifically noting that parquet use directories rather than files for their paths and these are not picked up with default config. For example:
example_partitioned_data:
type: PartitionedDataSet
dataset:
type: spark.SparkDataSet
file_format: parquet
save_args:
mode: overwrite
path: /dbfs/path/to/folder/
load_args:
withdirs: true
filename_suffix: '.pq'
Additionally, this would be benefitted by extra documentation on filename_suffix
, the present docs do not mention that the suffix is added to all partitions at write.
This should be added to https://kedro.readthedocs.io/en/stable/data/kedro_io.html#partitioned-dataset
@merelcht I updated this to a new milestone since I think it is just doc change not related to the overall design.
@noklam in gh-2430
Details: https://kedro-org.slack.com/archives/C03RKP2LW64/p1677525924480159
It's not very clear that users actually use
load_args
to support a different way of partitioning.Context
Why is this change important to you? How would you use it? How can it benefit other users?
* Improve Usage of `PartitionedDataSet` * Iterating through a directory is quite common for deep learning pipeline as u need to loop process data iteratively * Many datasets come with a format of folders of folders of files (e.g. Image data)
Closing this as a subset of general improvements in this page in #2941