iceberg AWS: Add table level S3 tags

This change adds table level S3 Tags to the objects while writing using S3FileIO. Users can pass these opt-in catalog properties to tag the objects in S3:

Spark SQL launch command:

sh spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.my_catalog.warehouse=s3://<bucket>/s3-tagging \
    --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
    --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
    --conf spark.sql.catalog.my_catalog.s3.write.table-name-tag-enabled=true \
    --conf spark.sql.catalog.my_catalog.s3.write.namespace-name-tag-enabled=true

Tags added in S3:

aws s3api get-object-tagging --bucket <bucket> --key s3-tagging/data/00000-0-a81d170b-9d71-4b45-886a-390088b2513b-00001.parquet
{
    "TagSet": [
        {
            "Key": "iceberg.table-name",
            "Value": "test_table_tag"
        },
        {
            "Key": "iceberg.namespace-name",
            "Value": "test_db"
        }
    ]
}

cc: @jackye1995 @arminnajafi @singhpk234 @amogh-jahagirdar @xiaoxuandev @yyanyy

Mar 25 '22 10:03 rajarshisarkar

I think it would be good to discuss what the use case is for this. I can understand trying to annotate some additional information but I'm not sure that this tracks very well with the evolution of tables. If you rename a table now all of the objects are out of sync. This also isn't particularly helpful for recovering table data because you actually don't know based on the file what manifests they belong to and that of course can change as well.

Because iceberg metadata tracks from the metadata file all the way down to the physical file, I'm not sure there's any reason to do this that I can think of.

Mar 27 '22 17:03 danielcweeks

I think it would be good to discuss what the use case is for this. I can understand trying to annotate some additional information but I'm not sure that this tracks very well with the evolution of tables. If you rename a table now all of the objects are out of sync.

Yes, only the table uuid tag would in sync for this scenario. @jackye1995 on the use case.

Apr 04 '22 11:04 rajarshisarkar

Thanks for updating this, could you also add an integration test to verify files written to a table are properly tagged in s3?

Sep 23 '22 16:09 jackye1995

cc @singhpk234 @amogh-jahagirdar for any comment

Sep 28 '22 05:09 jackye1995

There is 1 piece missing for this functionality, that is if the user renames the table, we should also re-tag. But this could be a follow-up PR and we can first merge the core feature in this PR.

Sep 28 '22 15:09 jackye1995

Thanks for finishing the work @rajarshisarkar ! and thanks @amogh-jahagirdar and @singhpk234 for the reviews!

Sep 28 '22 15:09 jackye1995

iceberg iceberg copied to clipboard

AWS: Add table level S3 tags

iceberg
iceberg copied to clipboard