flyteidl
flyteidl copied to clipboard
add partition_columns to StructuredDatasetType
Signed-off-by: Niels Bantilan [email protected]
Add partition_columns
to StructuredDatasetType
Partially addresses https://github.com/flyteorg/flyte/issues/3219
TL;DR
This PR adds an additional property to the StructureDatasetType
protobuf definition so that metadata about which columns in the dataset (some kind of DataFrame object) are used for partitioning the dataset into chunks, for example when a pandas.DataFrame
is serialized as a parquet file.
Type
- [ ] Bug Fix
- [X] Feature
- [ ] Plugin
Are all requirements met?
- [X] Code completed
- [ ] Smoke tested
- [ ] Unit tests added
- [ ] Code documentation added
- [ ] Any pending items have an associated Issue
Complete description
This change is required to store additional metadata about which columns are used for partitioning. Currently this only meaningfully affects the serialization/deserialization of parquet files, but in the future we could support the partitioning of other serialization formats.
Tracking Issue
Partly addresses https://github.com/flyteorg/flyte/issues/3219
Follow-up issue
NA
Codecov Report
Merging #364 (dee449a) into master (f3724b4) will not change coverage. The diff coverage is
n/a
.
@@ Coverage Diff @@
## master #364 +/- ##
=======================================
Coverage 73.71% 73.71%
=======================================
Files 18 18
Lines 1377 1377
=======================================
Hits 1015 1015
Misses 311 311
Partials 51 51
Flag | Coverage Δ | |
---|---|---|
unittests | 73.71% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
@eapolinario please confirm, did we decide this needed to be included in the serialized flyteidl type or could just be read from metadata within flytekit at runtime?
I may be missing something, but we need to include it in the type so that the structured dataset decoder has access to the metadata (unless we want to manually inspect the uri path for multiple directories)