kedro Expand partitioned dataset documentation

Description

Having recently used PartitionedDataSet for spark+parquet, it wasnt immediately clear how to configure this from documentation. I'd like to see additional examples/more complete documentation of the initialisation arguments

Context

This should speed up others configuring partitioned dataset in similar circumstances

Possible Implementation

Docs to include an example of a spark+parquet partitioned dataset, specifically noting that parquet use directories rather than files for their paths and these are not picked up with default config. For example:

example_partitioned_data:
  type: PartitionedDataSet
  dataset:
    type: spark.SparkDataSet
    file_format: parquet
    save_args:
      mode: overwrite
  path: /dbfs/path/to/folder/
  load_args:
    withdirs: true
  filename_suffix: '.pq'

Additionally, this would be benefitted by extra documentation on filename_suffix, the present docs do not mention that the suffix is added to all partitions at write.

May 23 '22 11:05 tomvigrass

This should be added to https://kedro.readthedocs.io/en/stable/data/kedro_io.html#partitioned-dataset

Aug 22 '22 13:08 merelcht

@merelcht I updated this to a new milestone since I think it is just doc change not related to the overall design.

Apr 06 '23 13:04 noklam

@noklam in gh-2430

Details: https://kedro-org.slack.com/archives/C03RKP2LW64/p1677525924480159

It's not very clear that users actually use load_args to support a different way of partitioning.

Context

Why is this change important to you? How would you use it? How can it benefit other users?
* Improve Usage of `PartitionedDataSet`

* Iterating through a directory is quite common for deep learning pipeline as u need to loop process data iteratively

* Many datasets come with a format of folders of folders of files (e.g. Image data)

Jun 22 '23 14:06 astrojuanlu

Closing this as a subset of general improvements in this page in #2941

Sep 29 '23 09:09 stichbury

kedro kedro copied to clipboard

Expand partitioned dataset documentation

Description

Context

Possible Implementation

Context

kedro
kedro copied to clipboard