iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

[feat] Ability to read/write table using `version-hint.txt`

Open kevinjqliu opened this issue 1 year ago • 6 comments

Feature Request / Improvement

Although not in the official spec, version-hint.txt can be useful to read an iceberg table without a catalog.

This is useful when considering an iceberg table as a collection of files (metadata and data files) in a "directory" (s3 path). This can also be useful when ingesting iceberg tables without a catalog. An iceberg table can thus be "packaged" as a directory.

Example Usecase

  • An Iceberg table is created in a service (with catalog) in the path (s3://blah/warehouse/foo/bar/)
  • Reading the Iceberg table with another service by just providing the path (s3://blah/warehouse/foo/bar/)

When reading, version-hint.txt determines the metadata json, usually provided by querying the catalog. When writing, version-hint.txt is committed with the atomic update to the catalog.

Additionally, StaticTable can use version-hint.txt to create an iceberg table from a path.

Relevant Issues:

cc @djouallah

kevinjqliu avatar May 23 '24 18:05 kevinjqliu

We discussed this issue in the monthly sync and agreed that this is a useful feature. We'll first implement the read side in pyiceberg. The write side is complicated due to having to support multiple concurrent writers and atomic updates in blob store, such as S3.

I will raise this issue with the Java Iceberg implementation and see if there's support also to include this as part of the Iceberg spec.

kevinjqliu avatar May 29 '24 01:05 kevinjqliu

DuckDB appears to depend on the version-hint.text file when scanning iceberg.

image

lamb-russell avatar Jul 24 '24 02:07 lamb-russell

@lamb-russell duckdb_iceberg can read the "metadata json file" directly.

See https://github.com/steven-luabase/duckdb-iceberg-demo/issues/1#issuecomment-2215482225

kevinjqliu avatar Jul 24 '24 02:07 kevinjqliu

It would be great if duckdb_iceberg could support reading directly from the catalog.

kevinjqliu avatar Jul 24 '24 02:07 kevinjqliu

it is quite ironic, it seems the only iceberg vendor who generate hint.text is snowflake !!! go figure

edit : no more :( snowflake stopped producing hint.text :(

djouallah avatar Jul 24 '24 02:07 djouallah

I think it is fine to add support for reading the version-hint.txt, but we should not produce it.

Fokko avatar Jul 24 '24 07:07 Fokko

@Fokko is this issue still open for working on? For context, we had to build a PyIceberg-based Hadoop Catalog with a subset of features for backwards compatibility when moving Bodo from Iceberg-Java to PyIceberg. See https://github.com/bodo-ai/Bodo/blob/main/bodo/io/iceberg/catalog/dir.py. It would be nice to move at least the read parts to the main repo

srilman avatar Mar 11 '25 22:03 srilman

fwiw, I just gave up and I am using duckdb to read iceberg table , pycieberg is clearly not interested in this scenario

djouallah avatar Mar 12 '25 04:03 djouallah

@srilman Yes, I still think it would be valuable

Fokko avatar Mar 12 '25 08:03 Fokko

I submitted a small PR to allow using version-hint.text: https://github.com/apache/iceberg-python/pull/1887

arnaudbriche avatar Apr 07 '25 14:04 arnaudbriche

i see this was merged but not included in the last release, any idea when it would be included in a release?

RoboDonut avatar Jun 13 '25 19:06 RoboDonut

The last release (0.9.1) was a patch release where we don't want to include new functionality. This will be part of 0.10.0

Fokko avatar Jun 13 '25 21:06 Fokko