iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

AWS: Add table level S3 tags

Open rajarshisarkar opened this issue 3 years ago • 3 comments

This change adds table level S3 Tags to the objects while writing using S3FileIO. Users can pass these opt-in catalog properties to tag the objects in S3:

Spark SQL launch command:

sh spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.my_catalog.warehouse=s3://<bucket>/s3-tagging \
    --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
    --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
    --conf spark.sql.catalog.my_catalog.s3.write.table-name-tag-enabled=true \
    --conf spark.sql.catalog.my_catalog.s3.write.namespace-name-tag-enabled=true

Tags added in S3:

aws s3api get-object-tagging --bucket <bucket> --key s3-tagging/data/00000-0-a81d170b-9d71-4b45-886a-390088b2513b-00001.parquet
{
    "TagSet": [
        {
            "Key": "iceberg.table-name",
            "Value": "test_table_tag"
        },
        {
            "Key": "iceberg.namespace-name",
            "Value": "test_db"
        }
    ]
}

cc: @jackye1995 @arminnajafi @singhpk234 @amogh-jahagirdar @xiaoxuandev @yyanyy

rajarshisarkar avatar Mar 25 '22 10:03 rajarshisarkar

I think it would be good to discuss what the use case is for this. I can understand trying to annotate some additional information but I'm not sure that this tracks very well with the evolution of tables. If you rename a table now all of the objects are out of sync. This also isn't particularly helpful for recovering table data because you actually don't know based on the file what manifests they belong to and that of course can change as well.

Because iceberg metadata tracks from the metadata file all the way down to the physical file, I'm not sure there's any reason to do this that I can think of.

danielcweeks avatar Mar 27 '22 17:03 danielcweeks

I think it would be good to discuss what the use case is for this. I can understand trying to annotate some additional information but I'm not sure that this tracks very well with the evolution of tables. If you rename a table now all of the objects are out of sync.

Yes, only the table uuid tag would in sync for this scenario. @jackye1995 on the use case.

rajarshisarkar avatar Apr 04 '22 11:04 rajarshisarkar

Thanks for updating this, could you also add an integration test to verify files written to a table are properly tagged in s3?

jackye1995 avatar Sep 23 '22 16:09 jackye1995

cc @singhpk234 @amogh-jahagirdar for any comment

jackye1995 avatar Sep 28 '22 05:09 jackye1995

There is 1 piece missing for this functionality, that is if the user renames the table, we should also re-tag. But this could be a follow-up PR and we can first merge the core feature in this PR.

jackye1995 avatar Sep 28 '22 15:09 jackye1995

Thanks for finishing the work @rajarshisarkar ! and thanks @amogh-jahagirdar and @singhpk234 for the reviews!

jackye1995 avatar Sep 28 '22 15:09 jackye1995