iceberg
iceberg copied to clipboard
AWS: Add table level S3 tags
This change adds table level S3 Tags to the objects while writing using S3FileIO. Users can pass these opt-in catalog properties to tag the objects in S3:
Spark SQL launch command:
sh spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.warehouse=s3://<bucket>/s3-tagging \
--conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--conf spark.sql.catalog.my_catalog.s3.write.table-name-tag-enabled=true \
--conf spark.sql.catalog.my_catalog.s3.write.namespace-name-tag-enabled=true
Tags added in S3:
aws s3api get-object-tagging --bucket <bucket> --key s3-tagging/data/00000-0-a81d170b-9d71-4b45-886a-390088b2513b-00001.parquet
{
"TagSet": [
{
"Key": "iceberg.table-name",
"Value": "test_table_tag"
},
{
"Key": "iceberg.namespace-name",
"Value": "test_db"
}
]
}
cc: @jackye1995 @arminnajafi @singhpk234 @amogh-jahagirdar @xiaoxuandev @yyanyy
I think it would be good to discuss what the use case is for this. I can understand trying to annotate some additional information but I'm not sure that this tracks very well with the evolution of tables. If you rename a table now all of the objects are out of sync. This also isn't particularly helpful for recovering table data because you actually don't know based on the file what manifests they belong to and that of course can change as well.
Because iceberg metadata tracks from the metadata file all the way down to the physical file, I'm not sure there's any reason to do this that I can think of.
I think it would be good to discuss what the use case is for this. I can understand trying to annotate some additional information but I'm not sure that this tracks very well with the evolution of tables. If you rename a table now all of the objects are out of sync.
Yes, only the table uuid tag would in sync for this scenario. @jackye1995 on the use case.
Thanks for updating this, could you also add an integration test to verify files written to a table are properly tagged in s3?
cc @singhpk234 @amogh-jahagirdar for any comment
There is 1 piece missing for this functionality, that is if the user renames the table, we should also re-tag. But this could be a follow-up PR and we can first merge the core feature in this PR.
Thanks for finishing the work @rajarshisarkar ! and thanks @amogh-jahagirdar and @singhpk234 for the reviews!