iceberg-rust
iceberg-rust copied to clipboard
Make current-snapshot-id optional while maintaining backwards compatibility
Resolves https://github.com/apache/iceberg-rust/issues/352
Problem: Previous versions of Java (<1.4.0) implementations incorrectly assume the optional attribute current-snapshot-id to be a required attribute in TableMetadata.
Solution: Use legacy-current-snapshot-id environment variable to force iceberg-rust to create and load a table with a metadata file compatible with the older versions.
Testing: Added new unit tests.
Future work:
- https://github.com/apache/iceberg-rust/issues/375: Implement a config file for iceberg-rust and provide users an additional option to set
legacy-current-snapshot-idas a configuration value.
@Fokko, these changes make current-snapshot-id optional and uses a flag to maintain backwards compatibility. Let me know what you think.
Please note that, with the flag on we can create and load a table with -1 as the current-snapshot-id. This is different from pyiceberg which I believe only supports creating a table.
Also, do we have a preference on where we want to document the "backwards compatibility" section?
Sorry for the late reply as I was touching grass.
We're trying to solve two problems:
- Don't produce
-1since it is erroneous unless there is a snapshot with the ID-1. You can still create a table with-1as the current snapshot ID if you set the flag. If you use Java<1.4.0within your organization, this might be required to be able to read the tables. - When deserializing, and we encounter a -1 as the current snapshot ID, we should convert it into a None.
cc @Xuanwo @Fokko @sdd @marvinlanhenke What do you think?
One problem with doing it through an env var is that it applies to every table you hit in your service. I think it would be better if it was a config param so that you can configure it per table.
@liurenjie1024 @sdd, sure, I can work on https://github.com/apache/iceberg-rust/issues/375 and use a config file to set the value. Let me know if you have a different approach in mind, thanks!
@liurenjie1024 @sdd, sure, I can work on #375 and use a config file to set the value. Let me know if you have a different approach in mind, thanks!
@s-akhtar-baig I think the first step maybe to add a config for the serializer/deseriazer of table metadata to control this behavior. Config file is just one approach to init the config, we should decouple these two things.
@liurenjie1024, I have pushed some changes to have config values for TableMetadata in one place. Please let me know if the direction is right and/or if you had a different idea in mind. I will handle reviews on the tests once I have your feedback on the config changes.
For now, I am using the environment to load these values but future work involves loading from a config file on top of that.
@liurenjie1024, I have pushed some changes to have config values for TableMetadata in one place. Please let me know if the direction is right and/or if you had a different idea in mind. I will handle reviews on the tests once I have your feedback on the config changes.
For now, I am using the environment to load these values but future work involves loading from a config file on top of that.
Hi, @s-akhtar-baig Thanks for the contribution. I have little concern about current approach because it seems not extensible to me. How about this:
pub struct TableMetadataParser {
use_legacy_id: bool,
}
impl TableMetadataParser {
pub async fn write(&self, output_file: OutputFile, table_metadata: &TableMetadata) {
....
}
pub async fn read(&self, input_file: InputFile) -> Result<TableMetadata> {
....
}
}
Then instead of directly ser/de table metadata, we will control behavior using TableMetadataParser, what do you think?
@liurenjie1024, I see. It makes sense to me and I will make changes accordingly. Thank you for adding your feedback and providing sample code!