iceberg-python Deleting with condition in partitioned tables is buggy

Deleting with condition in partitioned tables is buggy

Open panbamid-r opened this issue 1 year ago • 1 comments

Apache Iceberg version

0.7.0 (latest release)

Please describe the bug 🐞

TLDR: Deleting based on a condition, on a partitioned iceberg table, yields incorrect results and deletes more records than expected.

I've encountered the following issue while experimenting with pyiceberg, and trying to delete some records based on a condition on a partitioned table. I am using the latest version of pyiceberg (0.7.0) and I utilize the glue catalog.

I have a pandas dataframe with 100 records, which is populated as such:

id: randomly, between 1 and 10,000
created_at: date, randomly between 2024-01-01 and 2024-02-01
relevancy_score: randomly, between 0 and 1, following a beta distribution with alpha=2, beta=20.

Having created the iceberg table:

schema = Schema(
    NestedField(field_id=101, name="id", field_type=LongType(), required=True),
    NestedField(field_id=103, name="created_at", field_type=DateType(), required=False),
    NestedField(field_id=104, name="relevancy_score", field_type=DoubleType(), required=False),
)

partition_spec = PartitionSpec(
    PartitionField(
        source_id=103, field_id=2000, transform=DayTransform(), name="created_at_day"
    )
)

catalog.create_table(
    identifier=f"{glue_database_name}.{table_name}",
    schema=schema,
    partition_spec=partition_spec,
)

I append my df to the table:

arrow_schema = iceberg_table.schema().as_arrow()
docs_table = pa.Table.from_pandas(df, schema=arrow_schema)

# Append the data to the Iceberg table
iceberg_table.append(docs_table)

and I get the following info:

total_records = iceberg_table.scan().to_pandas().shape
print(f"Total records after append: {total_records}")

lower_before = iceberg_table.scan(row_filter=LessThan("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score < 0.1 before deletion: {lower_before}")

greater_before = iceberg_table.scan(row_filter=GreaterThanOrEqual("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score >= 0.1 before deletion: {greater_before}")

Total records after append: (100, 3) Records with relevancy_score < 0.1 before deletion: (64, 3) Records with relevancy_score >= 0.1 before deletion: (36, 3)

so then I run the deletion:

delete_condition = GreaterThanOrEqual("relevancy_score", 0.1)
iceberg_table.delete(delete_condition)

and the results are quite unexpected:

# Verify the total number of records in the table after deletion
remaining_records = iceberg_table.scan().to_pandas().shape
print(f"Total records after deletion: {remaining_records}")

# Verify the number of records that meet the relevancy_score condition after deletion
lower_after = iceberg_table.scan(row_filter=LessThan("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score < 0.1 after deletion: {lower_after}")

greater_after = iceberg_table.scan(row_filter=GreaterThanOrEqual("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score >= 0.1 after deletion: {greater_after}")

Total records after deletion: (26, 3) Records with relevancy_score < 0.1 after deletion: (26, 3) Records with relevancy_score >= 0.1 after deletion: (0, 3)

After checking the manifest files after the deletion what seems to be happening is:

The snapshot correctly keeps the files/days where there are no records where relevancy_score >= 0.1
The snapshot correctly marks as deleted the files/days where there are no records where relevancy_score < 0.1
The snapshot seems to incorrectly ignore completely files/days where there are both records with relevancy_score >= 0.1 and relevancy_score < 0.1

For example, let's consider the following slice of the dataset:

id	created_at	relevancy_score
808	2024-01-02	0.18115229995825877
1424	2024-01-02	0.13995060752152988
8026	2024-01-02	0.05992063209461162

The .avro files of the manifest post-deletion, do not mention any data files of the 2024-01-02 partition whatsoever when it should have kept the record with id=8026

I'm attaching an accompanying notebook which I used to recreate the issue I didn't encounter a similar issue when working with a non-partitioned table

Aug 12 '24 09:08 panbamid-r

@panbamid-r Thanks for raising this and the comprehensive example. I'm able to reproduce it locally, let me dig into this right away

Aug 12 '24 10:08 Fokko

iceberg-python iceberg-python copied to clipboard

Deleting with condition in partitioned tables is buggy

Apache Iceberg version

Please describe the bug 🐞

iceberg-python
iceberg-python copied to clipboard