Description

The get_add_actions method returns information about the files in a Delta table including the path, size, min/max column values, etc. Would be great if this returned information about the partition as well.

Nov 17 '23 21:11 MrPowers

It does already. Are you not seeing that in some example?

https://github.com/delta-io/delta-rs/blob/0ef811063aa12ca84908e62d714ad6f5a69fa1c5/python/tests/test_table_read.py#L331-L338

Nov 17 '23 21:11 wjones127

@wjones127 - I was looking at this example in the docs: https://delta-io.github.io/delta-rs/usage/examining-table/#current-add-actions

Perhaps the column is just not shown if the table isn't partitioned?

Nov 17 '23 21:11 MrPowers

Perhaps the column is just not shown if the table isn't partitioned?

Exactly. I suppose to make it clearer, we could have a null partition column shown if there aren't any.

Nov 17 '23 22:11 wjones127

Yeah, the column does not exist for non-partitioned tables. Currently I need to check if the column exists in the get_add_actions schema, for example.

    @cached_property
    def filters(self) -> DNFFilter:
        transactions = self._filtered_transactions or self.transactions
        dnf_filters: DNFFilter = []

        if "partition_values" in transactions.schema.names:
            partition_values = transactions["partition_values"].to_pylist()

            dnf_filters = [
                [(key, "=", str(value)) for key, value in partition.items()]
                for partition in partition_values
            ]

        return process_filters(dnf_filters)

Nov 28 '23 16:11 ldacey

Is that any worse than having to check whether it exists and is not null?

Nov 28 '23 16:11 wjones127

I think it'd be helpful to show it, even if it's null, just so ppl know it exists.

Nov 28 '23 18:11 MrPowers

take

Jan 03 '24 22:01 r3stl355

So @MrPowers, how would you see this to be presented to the consumer? At the moment there are 2 different ways the partition information is returned by get_add_actions

if called with flatten=true then it returns a struct with the fields being the partition column names (and values being the arrays). I can't see a way we can construct this structure if the table is not partitioned. One solution would be to use the same return as for the option 2 below. Output example of this

tests/test_table_read.py::test_add_actions_table[True] pyarrow.RecordBatch
path: string not null
size_bytes: int64 not null
modification_time: timestamp[ms] not null
data_change: bool not null
partition.year: string not null
partition.month: string not null
partition.day: string not null
----
path: ["year=2020/month=1/day=1/part-00000-8eafa330-3be9-4a39-ad78-fd13c2027c7e.c000.snappy.parquet","year=2020/month=2/day=3/part-00000-94d16827-f2fd-42cd-a060-f67ccc63ced9.c000.snappy.parquet","year=2020/month=2/day=5/part-00000-89cdd4c8-2af7-4add-8ea3-3990b2f027b5.c000.snappy.parquet","year=2021/month=12/day=20/part-00000-9275fdf4-3961-4184-baa0-1c8a2bb98104.c000.snappy.parquet","year=2021/month=12/day=4/part-00000-6dc763c0-3e8b-4d52-b19e-1f92af3fbb25.c000.snappy.parquet","year=2021/month=4/day=5/part-00000-c5856301-3439-4032-a6fc-22b7bc92bebb.c000.snappy.parquet"]
size_bytes: [414,414,414,407,414,414]
modification_time: [2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000]
data_change: [true,true,true,true,true,true]
partition.year: ["2020","2020","2020","2021","2021","2021"]
partition.month: ["1","2","2","12","12","4"]
partition.day: ["1","3","5","20","4","5"]

when flatten=true then the returned record batch will have a column named partition_values containing an array of structs and each struct having a field per partition column. For a non table this could be a null (not a struct). Here is an example for a partioned table

tests/test_table_read.py::test_add_actions_table[True] pyarrow.RecordBatch
path: string not null
size_bytes: int64 not null
modification_time: timestamp[ms] not null
data_change: bool not null
partition.year: string not null
partition.month: string not null
partition.day: string not null
----
path: ["year=2020/month=1/day=1/part-00000-8eafa330-3be9-4a39-ad78-fd13c2027c7e.c000.snappy.parquet","year=2020/month=2/day=3/part-00000-94d16827-f2fd-42cd-a060-f67ccc63ced9.c000.snappy.parquet","year=2020/month=2/day=5/part-00000-89cdd4c8-2af7-4add-8ea3-3990b2f027b5.c000.snappy.parquet","year=2021/month=12/day=20/part-00000-9275fdf4-3961-4184-baa0-1c8a2bb98104.c000.snappy.parquet","year=2021/month=12/day=4/part-00000-6dc763c0-3e8b-4d52-b19e-1f92af3fbb25.c000.snappy.parquet","year=2021/month=4/day=5/part-00000-c5856301-3439-4032-a6fc-22b7bc92bebb.c000.snappy.parquet"]
size_bytes: [414,414,414,407,414,414]
modification_time: [2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000,2021-03-12 13:27:26.000]
data_change: [true,true,true,true,true,true]
partition.year: ["2020","2020","2020","2021","2021","2021"]
partition.month: ["1","2","2","12","12","4"]
partition.day: ["1","3","5","20","4","5"]

Here is what I am proposing will look like (not the same table) - though that nul not null is a bit confusing to me

tests/test_table_read.py::test_add_actions_table_from_unpartitioned_table[False] pyarrow.RecordBatch
path: string not null
size_bytes: int64 not null
modification_time: timestamp[ms] not null
data_change: bool not null
partition_values: null not null
...
----
path: ["part-00000-c9b90f86-73e6-46c8-93ba-ff6bfaf892a1-c000.snappy.parquet","part-00000-04ec9591-0b73-459e-8d18-ba5711d6cbe1-c000.snappy.parquet"]
size_bytes: [440,440]
modification_time: [2021-03-06 15:16:07.000,2021-03-06 15:16:16.000]
data_change: [true,true]
partition_values: 2 nulls
...

There is also a way to Create a new [StructArray] of length len where all values are null so maybe we could consider using that with some dummy field name

@ldacey you seem to be using this, would in your opinion would be the best way to present this to consumer?

Jan 04 '24 22:01 r3stl355

@r3stl355 - see this docs page for an example of what I'm referring to.

The following code snippet doesn't output a partition_values column if the Delta table is not partitioned:

dt = DeltaTable("../rust/tests/data/delta-0.8.0")
dt.get_add_actions(flatten=True).to_pandas()

It would be great if this output a partition_values column, even if the Delta table is not partitioned. Let me know if that helps clarify!

Jan 04 '24 23:01 MrPowers

Let me re-phrase my questions @MrPowers

If the table is not partitioned, what should the partition_values column contain (e.g. nulls)?
Calling get_add_actions(flatten=True).to_pandas() on a partitioned table will never have a field called partition_values because it will create a flattened column names (e.g. partition.year). Would this be too confusing to the user because they would see different column name depending on if the table is partitioned or not (this will not be happening if flatten=False adding further confusion imo)

Jan 05 '24 09:01 r3stl355

Perhaps we could just add another example to the docs to show what it looks like for a partitioned table?

I think that would address @MrPowers original concern that it wasn't clear it returned partition values. Trying to change the output schema seems like too much of a headache, IMO.

Jan 05 '24 17:01 wjones127

I agree. Adding a column with null values is not a problem but the fact that it will also change the expected column format could be confusing.

What say you @MrPowers, should we just add more info to the docs, e.g. add an example of output from using flatten=False and also a note that if the table is not partitioned then there will be no partition related information in the data returned by get_add_actions?

Jan 05 '24 18:01 r3stl355

delta-rs
delta-rs copied to clipboard

Add partition value to get_add_actions

Description

delta-rs delta-rs copied to clipboard

Add partition value to get_add_actions

Description

delta-rs
delta-rs copied to clipboard