iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Metadata Log Entries metadata table

Open kevinjqliu opened this issue 1 year ago • 0 comments

Resolves #594 (and part of #511)

This PR creates a metadata table for "Metadata Log Entries", similar to its spark equivalent (metadata_log_entries).

To query the metadata table, use

tbl.inspect.metadata_log_entries()

References

  • #524 (snapshots metadata table)
  • #602 (references metadata table)
  • #551 (entries metadata table)

Spark metadata log entries table is implemented in MetadataLogEntriesTable.java

The metadata log entries log is modified during TableMetadata creation, in which the current metadata log entry is appended (1, 2, 3). This leads to a surprising behavior where the last row of metadata entries table is based on when the query ran.

For example,

a = spark.sql(f"SELECT * FROM {identifier}.metadata_log_entries").toPandas()
import time
time.sleep(5)
b = spark.sql(f"SELECT * FROM {identifier}.metadata_log_entries").toPandas()

(Pdb) display(a)
display (a):                 timestamp                                               file  latest_snapshot_id  latest_schema_id  latest_sequence_number
0 2024-04-28 17:21:31.336  s3://warehouse/default/table_metadata_log_entr...                 NaN               NaN                     NaN
1 2024-04-28 17:21:31.531  s3://warehouse/default/table_metadata_log_entr...        4.105762e+18               0.0                     0.0
2 2024-04-28 17:21:31.600  s3://warehouse/default/table_metadata_log_entr...        7.201925e+18               0.0                     0.0
3 2024-04-28 17:21:34.204  s3://warehouse/default/table_metadata_log_entr...        1.984627e+18               0.0                     0.0

(Pdb) display(b)
display (b):                 timestamp                                               file  latest_snapshot_id  latest_schema_id  latest_sequence_number
0 2024-04-28 17:21:31.336  s3://warehouse/default/table_metadata_log_entr...                 NaN               NaN                     NaN
1 2024-04-28 17:21:31.531  s3://warehouse/default/table_metadata_log_entr...        4.105762e+18               0.0                     0.0
2 2024-04-28 17:21:31.600  s3://warehouse/default/table_metadata_log_entr...        7.201925e+18               0.0                     0.0
3 2024-04-28 17:21:42.336  s3://warehouse/default/table_metadata_log_entr...        1.984627e+18               0.0                     0.0

# Notice the timestamp in the last row of a and b differs by more than 5 seconds

Get Snapshot by timestamp (_snapshot_as_of_timestamp_ms) is modeled after snapshotIdAsOfTime from Java

There's an issue with reading V1 spec where the sequence-number is None instead of 0. According to the Iceberg spec, when reading v1 metadata for v2, Snapshot field sequence-number must default to 0 (source).

kevinjqliu avatar Apr 28 '24 16:04 kevinjqliu