[Bug]: Nessie Iceberg `loadTable` should not send `py-io-impl` as part of the config
What happened
Context: https://github.com/apache/iceberg-python/issues/1589#issuecomment-2646078141
#9868 added py-io-impl=pyiceberg.io.fsspec.FsspecFileIO config to icebergConfigDefaults for the config endpoint (/config) to send this property. But this also makes loadTable to send this config.
loadTable should not be sending this config as part of its table response since this will override the client's fileio configs
How to reproduce it
Use pyiceberg, connect to Nessie Iceberg Rest catalog. print the load_table response config
Nessie server type (docker/uber-jar/built from source) and version
N/A
Client type (Ex: UI/Spark/pynessie ...) and version
No response
Additional information
No response
Yes, I agree.
@dimas-b my suggestion is to remove any "hardcoded" or default configurations on Nessie, leaving to the admin (@guitcastro FYI) to configure those settings.
I think the initial configs have been added due to:
- https://github.com/projectnessie/nessie/issues/9318
We then discussed the consequence of these defaults value here:
- https://project-nessie.zulipchat.com/#narrow/channel/371187-general/topic/How.20to.20use.20PyarrowFileIO.20on.20pyiceberg.3F
This new feature partially solve the problem:
- https://github.com/projectnessie/nessie/pull/10296
But in general I prefer Nessie to be as much agnostic as possible on these configurations.
Unfortunately with the current implementation of pyiceberg (python) / iceberg (java) and in general on how the Catalog API has been designed, configs returned by the server always overwrite client's configs.
Unfortunately with the current implementation of pyiceberg (python) / iceberg (java) and in general on how the Catalog API has been designed, configs returned by the server always overwrite client's configs.
Yea - it's not great that there's no distinction between "defaults" and "overrides" at that level.
What we could do in Nessie is to prefer the properties that have been explicitly set as table/view-properties. WDYT?
What we could do in Nessie is to prefer the properties that have been explicitly set as table/view-properties. WDYT?
It may not be the way to go unless I can override them or "blacklist" some of those configs.
Let's suppose an EMR cluster writes a table using Nessie/Iceberg. That table will have the usual Java-related properties. I, as a Python consumer that uses pyiceberg, can read that table without any interference, because my local pyiceberg will use arrow. That's great.
Then we have another producer, it writes a table (on the same catalog) using s3fs. The producer could push its properties on the table metadata, and it could set the py-impl to s3fs...
In this case, my consumer (another Python microservice, but this time it does not have s3fs but just arrow), will fail to read that table because the metadata will be pushed to the info about which impl must be used, but I don't have that dep on my system.
The catalog may "override" that configuration, but still it will force all the consumers to use a specific implementation, which is wrong from my pov.
From an iceberg definition standpoint, I still believe this is an imperfect solution because the consumer must have the last word (always); I see why it is happening, 99% of the iceberg consumers use java, but if we want to have an open standard, we need to be open to different implementations.
IDK. Maybe the solution is to have an allowed and disallowed list of properties to propagate to the clients/consumers and for masking them at the catalog level.
It could "save" Nessie, but yeah @kevinjqliu, this is an iceberg problem :)
Love :)
Setting such default leads to really painful debugging experience.
I've noticed significant performance difference between catalog implementations doing simple table scan (like 10x difference).
Turned out to be exclusively due to usage of different io impls for s3.
Which nessie set as default is less performant.
It is configurable AFAIK... something like nessie.catalog.service.s3.default-options.table-config-overrides.py-io-impl=pyiceberg.io.pyarrow.PyArrowFileIO
It's, but it took some amount of time to discover it exists in the first place :)
It's mentioned in the release notes for 0.102.3 :)
https://github.com/projectnessie/nessie/releases/tag/nessie-0.102.3 https://github.com/projectnessie/nessie/blob/main/CHANGELOG.md