Update Iceberg metadata in case of DR
Query engine
Spark
Question
Lets say we have a DR Situation where we like to up the iceberg metadata and data copied to DR location. Since s3 buckets are global namespaces we will have different bucket names in DR locations. How to rename the metadata so that it start pointing to the correct location of the DR s3 location? Do we have any util or spark procedure for it?
Just seen this thread https://github.com/apache/iceberg/issues/1617
Can we use migrate_table procedure for this to specify the s3 path that points to the destination location
@asheeshgarg, will s3 access-points for iceberg, work for your use case ?
@singhpk234 s3 access points are still region specific Access point ARNs use the format arn:aws:s3:region:account-id:accesspoint/resource
When we enable
--conf spark.sql.catalog.test.s3.access-points.my-bucket1=arn:aws:s3::123456789012:accesspoint:mfzwi23gnjvgw.mrap
--conf spark.sql.catalog.test.s3.access-points.my-bucket2=arn:aws:s3::123456789012:accesspoint:mfzwi23gnjvgw.mrap
Does it use my-bucket1while writing the data in metadata? which we can map to specific bucket in case of DR?
@asheeshgarg
The metadata files will still be pointing to my-bucket1 (actual s3 path) but while making s3 request via Iceberg (GET + PUT) the my-bucket1 path will be replaced by access-point. Now access point will take care of replication across buckets configured and choose the best available low latency bucket behind the access point.
@singhpk234 so just to understand it correctly we will define two buckets for cross region
--conf spark.sql.catalog.test.s3.access-points.my-bucket1=arn:aws-reigon1
--conf spark.sql.catalog.test.s3.access-points.my-bucket2=arn:aws:s3-reigon2
and iceberg take care of replicating it across region.
Reigon1 meta data will be replaced by mybuket1 actual pointer of s3 in metadata
Region2 meta data will be replaced by mybuket2 actual pointer of s3 in metadata
and we just need to start the metastore in new region and it will work. Is this correct understanding
yes, if you map both the bucket (present in different region) to a multi-region access-point.
can ref to this slack thread as well, where this idea originated : https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1645066803099319
Reigon1 meta data will be replaced by mybuket1 actual pointer of s3 in metadata Region2 meta data will be replaced by mybuket2 actual pointer of s3 in metadata
No, Let's say your table path is under mybucket1 so both mybucket1 in region1 and mybucket2 and region2 will have paths of mybucket1, inside the metadata files. It just at the time of S3 (GET / PUT) call we will replace mybucket1 reference with multi-region access point.
Now if you can use a mutli-region access-point pointing to mybucket1, and mybucket2. it acts a proxy and single global hostname between two and internally routes the request to location with lowest latency...
More about access-points here https://aws.amazon.com/s3/features/multi-region-access-points/
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'