nessie icon indicating copy to clipboard operation
nessie copied to clipboard

[Bug]: Nessie Iceberg `loadTable` should not send `py-io-impl` as part of the config

Open kevinjqliu opened this issue 10 months ago • 7 comments

What happened

Context: https://github.com/apache/iceberg-python/issues/1589#issuecomment-2646078141

#9868 added py-io-impl=pyiceberg.io.fsspec.FsspecFileIO config to icebergConfigDefaults for the config endpoint (/config) to send this property. But this also makes loadTable to send this config. loadTable should not be sending this config as part of its table response since this will override the client's fileio configs

How to reproduce it

Use pyiceberg, connect to Nessie Iceberg Rest catalog. print the load_table response config

Nessie server type (docker/uber-jar/built from source) and version

N/A

Client type (Ex: UI/Spark/pynessie ...) and version

No response

Additional information

No response

kevinjqliu avatar Feb 09 '25 05:02 kevinjqliu

Yes, I agree.

@dimas-b my suggestion is to remove any "hardcoded" or default configurations on Nessie, leaving to the admin (@guitcastro FYI) to configure those settings.

I think the initial configs have been added due to:

  • https://github.com/projectnessie/nessie/issues/9318

We then discussed the consequence of these defaults value here:

  • https://project-nessie.zulipchat.com/#narrow/channel/371187-general/topic/How.20to.20use.20PyarrowFileIO.20on.20pyiceberg.3F

This new feature partially solve the problem:

  • https://github.com/projectnessie/nessie/pull/10296

But in general I prefer Nessie to be as much agnostic as possible on these configurations.

Unfortunately with the current implementation of pyiceberg (python) / iceberg (java) and in general on how the Catalog API has been designed, configs returned by the server always overwrite client's configs.

bigluck avatar Feb 09 '25 10:02 bigluck

Unfortunately with the current implementation of pyiceberg (python) / iceberg (java) and in general on how the Catalog API has been designed, configs returned by the server always overwrite client's configs.

Yea - it's not great that there's no distinction between "defaults" and "overrides" at that level.

What we could do in Nessie is to prefer the properties that have been explicitly set as table/view-properties. WDYT?

snazy avatar Feb 18 '25 11:02 snazy

What we could do in Nessie is to prefer the properties that have been explicitly set as table/view-properties. WDYT?

It may not be the way to go unless I can override them or "blacklist" some of those configs.

Let's suppose an EMR cluster writes a table using Nessie/Iceberg. That table will have the usual Java-related properties. I, as a Python consumer that uses pyiceberg, can read that table without any interference, because my local pyiceberg will use arrow. That's great.

Then we have another producer, it writes a table (on the same catalog) using s3fs. The producer could push its properties on the table metadata, and it could set the py-impl to s3fs... In this case, my consumer (another Python microservice, but this time it does not have s3fs but just arrow), will fail to read that table because the metadata will be pushed to the info about which impl must be used, but I don't have that dep on my system.

The catalog may "override" that configuration, but still it will force all the consumers to use a specific implementation, which is wrong from my pov.

From an iceberg definition standpoint, I still believe this is an imperfect solution because the consumer must have the last word (always); I see why it is happening, 99% of the iceberg consumers use java, but if we want to have an open standard, we need to be open to different implementations.

IDK. Maybe the solution is to have an allowed and disallowed list of properties to propagate to the clients/consumers and for masking them at the catalog level.

It could "save" Nessie, but yeah @kevinjqliu, this is an iceberg problem :)

Love :)

bigluck avatar Feb 19 '25 19:02 bigluck

Setting such default leads to really painful debugging experience. I've noticed significant performance difference between catalog implementations doing simple table scan (like 10x difference). Turned out to be exclusively due to usage of different io impls for s3. Which nessie set as default is less performant.

Erigara avatar Jun 18 '25 13:06 Erigara

It is configurable AFAIK... something like nessie.catalog.service.s3.default-options.table-config-overrides.py-io-impl=pyiceberg.io.pyarrow.PyArrowFileIO

dimas-b avatar Jun 18 '25 14:06 dimas-b

It's, but it took some amount of time to discover it exists in the first place :)

Erigara avatar Jun 18 '25 15:06 Erigara

It's mentioned in the release notes for 0.102.3 :)

https://github.com/projectnessie/nessie/releases/tag/nessie-0.102.3 https://github.com/projectnessie/nessie/blob/main/CHANGELOG.md

dimas-b avatar Jun 18 '25 17:06 dimas-b