delta-rs
delta-rs copied to clipboard
Add partition value to get_add_actions
Description
The get_add_actions method returns information about the files in a Delta table including the path, size, min/max column values, etc. Would be great if this returned information about the partition as well.
It does already. Are you not seeing that in some example?
https://github.com/delta-io/delta-rs/blob/0ef811063aa12ca84908e62d714ad6f5a69fa1c5/python/tests/test_table_read.py#L331-L338
@wjones127 - I was looking at this example in the docs: https://delta-io.github.io/delta-rs/usage/examining-table/#current-add-actions
Perhaps the column is just not shown if the table isn't partitioned?
Perhaps the column is just not shown if the table isn't partitioned?
Exactly. I suppose to make it clearer, we could have a null partition column shown if there aren't any.
Yeah, the column does not exist for non-partitioned tables. Currently I need to check if the column exists in the get_add_actions schema, for example.
@cached_property
def filters(self) -> DNFFilter:
transactions = self._filtered_transactions or self.transactions
dnf_filters: DNFFilter = []
if "partition_values" in transactions.schema.names:
partition_values = transactions["partition_values"].to_pylist()
dnf_filters = [
[(key, "=", str(value)) for key, value in partition.items()]
for partition in partition_values
]
return process_filters(dnf_filters)
Is that any worse than having to check whether it exists and is not null?
I think it'd be helpful to show it, even if it's null, just so ppl know it exists.
take
So @MrPowers, how would you see this to be presented to the consumer? At the moment there are 2 different ways the partition information is returned by get_add_actions
- if called with
flatten=truethen it returns a struct with the fields being the partition column names (and values being the arrays). I can't see a way we can construct this structure if the table is not partitioned. One solution would be to use the same return as for the option 2 below. Output example of this
tests/test_table_read.py::test_add_actions_table[True] pyarrow.RecordBatch
path: string not null
size_bytes: int64 not null
modification_time: timestamp[ms] not null
data_change: bool not null
partition.year: string not null
partition.month: string not null
partition.day: string not null
----
path: ["year=2020/month=1/day=1/part-00000-8eafa330-3be9-4a39-ad78-fd13c2027c7e.c000.snappy.parquet","year=2020/month=2/day=3/part-00000-94d16827-f2fd-42cd-a060-f67ccc63ced9.c000.snappy.parquet","year=2020/month=2/day=5/part-00000-89cdd4c8-2af7-4add-8ea3-3990b2f027b5.c000.snappy.parquet","year=2021/month=12/day=20/part-00000-9275fdf4-3961-4184-baa0-1c8a2bb98104.c000.snappy.parquet","year=2021/month=12/day=4/part-00000-6dc763c0-3e8b-4d52-b19e-1f92af3fbb25.c000.snappy.parquet","year=2021/month=4/day=5/part-00000-c5856301-3439-4032-a6fc-22b7bc92bebb.c000.snappy.parquet"]
size_bytes: [414,414,414,407,414,414]
modification_time: [2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000]
data_change: [true,true,true,true,true,true]
partition.year: ["2020","2020","2020","2021","2021","2021"]
partition.month: ["1","2","2","12","12","4"]
partition.day: ["1","3","5","20","4","5"]
- when
flatten=truethen the returned record batch will have a column namedpartition_valuescontaining an array of structs and each struct having a field per partition column. For a non table this could be a null (not a struct). Here is an example for a partioned table
tests/test_table_read.py::test_add_actions_table[True] pyarrow.RecordBatch
path: string not null
size_bytes: int64 not null
modification_time: timestamp[ms] not null
data_change: bool not null
partition.year: string not null
partition.month: string not null
partition.day: string not null
----
path: ["year=2020/month=1/day=1/part-00000-8eafa330-3be9-4a39-ad78-fd13c2027c7e.c000.snappy.parquet","year=2020/month=2/day=3/part-00000-94d16827-f2fd-42cd-a060-f67ccc63ced9.c000.snappy.parquet","year=2020/month=2/day=5/part-00000-89cdd4c8-2af7-4add-8ea3-3990b2f027b5.c000.snappy.parquet","year=2021/month=12/day=20/part-00000-9275fdf4-3961-4184-baa0-1c8a2bb98104.c000.snappy.parquet","year=2021/month=12/day=4/part-00000-6dc763c0-3e8b-4d52-b19e-1f92af3fbb25.c000.snappy.parquet","year=2021/month=4/day=5/part-00000-c5856301-3439-4032-a6fc-22b7bc92bebb.c000.snappy.parquet"]
size_bytes: [414,414,414,407,414,414]
modification_time: [2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000]
data_change: [true,true,true,true,true,true]
partition.year: ["2020","2020","2020","2021","2021","2021"]
partition.month: ["1","2","2","12","12","4"]
partition.day: ["1","3","5","20","4","5"]
Here is what I am proposing will look like (not the same table) - though that nul not null is a bit confusing to me
tests/test_table_read.py::test_add_actions_table_from_unpartitioned_table[False] pyarrow.RecordBatch
path: string not null
size_bytes: int64 not null
modification_time: timestamp[ms] not null
data_change: bool not null
partition_values: null not null
...
----
path: ["part-00000-c9b90f86-73e6-46c8-93ba-ff6bfaf892a1-c000.snappy.parquet","part-00000-04ec9591-0b73-459e-8d18-ba5711d6cbe1-c000.snappy.parquet"]
size_bytes: [440,440]
modification_time: [2021-03-06 15:16:07.000,2021-03-06 15:16:16.000]
data_change: [true,true]
partition_values: 2 nulls
...
There is also a way to Create a new [StructArray] of length len where all values are null so maybe we could consider using that with some dummy field name
@ldacey you seem to be using this, would in your opinion would be the best way to present this to consumer?
@r3stl355 - see this docs page for an example of what I'm referring to.
The following code snippet doesn't output a partition_values column if the Delta table is not partitioned:
dt = DeltaTable("../rust/tests/data/delta-0.8.0")
dt.get_add_actions(flatten=True).to_pandas()
It would be great if this output a partition_values column, even if the Delta table is not partitioned. Let me know if that helps clarify!
Let me re-phrase my questions @MrPowers
-
If the table is not partitioned, what should the
partition_valuescolumn contain (e.g.nulls)? -
Calling
get_add_actions(flatten=True).to_pandas()on a partitioned table will never have a field calledpartition_valuesbecause it will create a flattened column names (e.g.partition.year). Would this be too confusing to the user because they would see different column name depending on if the table is partitioned or not (this will not be happening ifflatten=Falseadding further confusion imo)
Perhaps we could just add another example to the docs to show what it looks like for a partitioned table?
I think that would address @MrPowers original concern that it wasn't clear it returned partition values. Trying to change the output schema seems like too much of a headache, IMO.
I agree. Adding a column with null values is not a problem but the fact that it will also change the expected column format could be confusing.
What say you @MrPowers, should we just add more info to the docs, e.g. add an example of output from using flatten=False and also a note that if the table is not partitioned then there will be no partition related information in the data returned by get_add_actions?