iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

Make current-snapshot-id optional while maintaining backwards compatibility

Open s-akhtar-baig opened this issue 1 year ago • 10 comments

Resolves https://github.com/apache/iceberg-rust/issues/352

Problem: Previous versions of Java (<1.4.0) implementations incorrectly assume the optional attribute current-snapshot-id to be a required attribute in TableMetadata.

Solution: Use legacy-current-snapshot-id environment variable to force iceberg-rust to create and load a table with a metadata file compatible with the older versions.

Testing: Added new unit tests.

Future work:

  • https://github.com/apache/iceberg-rust/issues/375: Implement a config file for iceberg-rust and provide users an additional option to set legacy-current-snapshot-id as a configuration value.

s-akhtar-baig avatar May 15 '24 18:05 s-akhtar-baig

@Fokko, these changes make current-snapshot-id optional and uses a flag to maintain backwards compatibility. Let me know what you think.

Please note that, with the flag on we can create and load a table with -1 as the current-snapshot-id. This is different from pyiceberg which I believe only supports creating a table.

s-akhtar-baig avatar May 15 '24 18:05 s-akhtar-baig

Also, do we have a preference on where we want to document the "backwards compatibility" section?

s-akhtar-baig avatar May 15 '24 18:05 s-akhtar-baig

Sorry for the late reply as I was touching grass.

We're trying to solve two problems:

  • Don't produce -1 since it is erroneous unless there is a snapshot with the ID -1. You can still create a table with -1 as the current snapshot ID if you set the flag. If you use Java <1.4.0 within your organization, this might be required to be able to read the tables.
  • When deserializing, and we encounter a -1 as the current snapshot ID, we should convert it into a None.

Fokko avatar May 23 '24 07:05 Fokko

cc @Xuanwo @Fokko @sdd @marvinlanhenke What do you think?

liurenjie1024 avatar May 27 '24 14:05 liurenjie1024

One problem with doing it through an env var is that it applies to every table you hit in your service. I think it would be better if it was a config param so that you can configure it per table.

sdd avatar May 27 '24 18:05 sdd

@liurenjie1024 @sdd, sure, I can work on https://github.com/apache/iceberg-rust/issues/375 and use a config file to set the value. Let me know if you have a different approach in mind, thanks!

s-akhtar-baig avatar May 31 '24 16:05 s-akhtar-baig

@liurenjie1024 @sdd, sure, I can work on #375 and use a config file to set the value. Let me know if you have a different approach in mind, thanks!

@s-akhtar-baig I think the first step maybe to add a config for the serializer/deseriazer of table metadata to control this behavior. Config file is just one approach to init the config, we should decouple these two things.

liurenjie1024 avatar Jun 01 '24 03:06 liurenjie1024

@liurenjie1024, I have pushed some changes to have config values for TableMetadata in one place. Please let me know if the direction is right and/or if you had a different idea in mind. I will handle reviews on the tests once I have your feedback on the config changes.

For now, I am using the environment to load these values but future work involves loading from a config file on top of that.

s-akhtar-baig avatar Jun 19 '24 17:06 s-akhtar-baig

@liurenjie1024, I have pushed some changes to have config values for TableMetadata in one place. Please let me know if the direction is right and/or if you had a different idea in mind. I will handle reviews on the tests once I have your feedback on the config changes.

For now, I am using the environment to load these values but future work involves loading from a config file on top of that.

Hi, @s-akhtar-baig Thanks for the contribution. I have little concern about current approach because it seems not extensible to me. How about this:


pub struct TableMetadataParser {
   use_legacy_id: bool,
}

impl TableMetadataParser {
  pub async fn write(&self, output_file: OutputFile, table_metadata: &TableMetadata) {
   ....
 }

pub async fn read(&self, input_file: InputFile) -> Result<TableMetadata> {
  ....
}
}

Then instead of directly ser/de table metadata, we will control behavior using TableMetadataParser, what do you think?

liurenjie1024 avatar Jun 26 '24 08:06 liurenjie1024

@liurenjie1024, I see. It makes sense to me and I will make changes accordingly. Thank you for adding your feedback and providing sample code!

s-akhtar-baig avatar Jun 26 '24 19:06 s-akhtar-baig