flyteidl icon indicating copy to clipboard operation
flyteidl copied to clipboard

add partition_columns to StructuredDatasetType

Open cosmicBboy opened this issue 2 years ago • 2 comments

Signed-off-by: Niels Bantilan [email protected]

Add partition_columns to StructuredDatasetType

Partially addresses https://github.com/flyteorg/flyte/issues/3219

TL;DR

This PR adds an additional property to the StructureDatasetType protobuf definition so that metadata about which columns in the dataset (some kind of DataFrame object) are used for partitioning the dataset into chunks, for example when a pandas.DataFrame is serialized as a parquet file.

Type

  • [ ] Bug Fix
  • [X] Feature
  • [ ] Plugin

Are all requirements met?

  • [X] Code completed
  • [ ] Smoke tested
  • [ ] Unit tests added
  • [ ] Code documentation added
  • [ ] Any pending items have an associated Issue

Complete description

This change is required to store additional metadata about which columns are used for partitioning. Currently this only meaningfully affects the serialization/deserialization of parquet files, but in the future we could support the partitioning of other serialization formats.

Tracking Issue

Partly addresses https://github.com/flyteorg/flyte/issues/3219

Follow-up issue

NA

cosmicBboy avatar Feb 10 '23 16:02 cosmicBboy

Codecov Report

Merging #364 (dee449a) into master (f3724b4) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #364   +/-   ##
=======================================
  Coverage   73.71%   73.71%           
=======================================
  Files          18       18           
  Lines        1377     1377           
=======================================
  Hits         1015     1015           
  Misses        311      311           
  Partials       51       51           
Flag Coverage Δ
unittests 73.71% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Feb 10 '23 16:02 codecov[bot]

@eapolinario please confirm, did we decide this needed to be included in the serialized flyteidl type or could just be read from metadata within flytekit at runtime?

I may be missing something, but we need to include it in the type so that the structured dataset decoder has access to the metadata (unless we want to manually inspect the uri path for multiple directories)

cosmicBboy avatar Feb 10 '23 18:02 cosmicBboy