iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Update Iceberg metadata in case of DR

Open asheeshgarg opened this issue 3 years ago • 7 comments

Query engine

Spark

Question

Lets say we have a DR Situation where we like to up the iceberg metadata and data copied to DR location. Since s3 buckets are global namespaces we will have different bucket names in DR locations. How to rename the metadata so that it start pointing to the correct location of the DR s3 location? Do we have any util or spark procedure for it?

asheeshgarg avatar Sep 16 '22 19:09 asheeshgarg

Just seen this thread https://github.com/apache/iceberg/issues/1617

asheeshgarg avatar Sep 16 '22 19:09 asheeshgarg

Can we use migrate_table procedure for this to specify the s3 path that points to the destination location

asheeshgarg avatar Sep 19 '22 15:09 asheeshgarg

@asheeshgarg, will s3 access-points for iceberg, work for your use case ?

singhpk234 avatar Sep 19 '22 16:09 singhpk234

@singhpk234 s3 access points are still region specific Access point ARNs use the format arn:aws:s3:region:account-id:accesspoint/resource When we enable --conf spark.sql.catalog.test.s3.access-points.my-bucket1=arn:aws:s3::123456789012:accesspoint:mfzwi23gnjvgw.mrap
--conf spark.sql.catalog.test.s3.access-points.my-bucket2=arn:aws:s3::123456789012:accesspoint:mfzwi23gnjvgw.mrap Does it use my-bucket1while writing the data in metadata? which we can map to specific bucket in case of DR?

asheeshgarg avatar Sep 19 '22 17:09 asheeshgarg

@asheeshgarg

The metadata files will still be pointing to my-bucket1 (actual s3 path) but while making s3 request via Iceberg (GET + PUT) the my-bucket1 path will be replaced by access-point. Now access point will take care of replication across buckets configured and choose the best available low latency bucket behind the access point.

singhpk234 avatar Sep 19 '22 17:09 singhpk234

@singhpk234 so just to understand it correctly we will define two buckets for cross region
--conf spark.sql.catalog.test.s3.access-points.my-bucket1=arn:aws-reigon1 --conf spark.sql.catalog.test.s3.access-points.my-bucket2=arn:aws:s3-reigon2 and iceberg take care of replicating it across region. Reigon1 meta data will be replaced by mybuket1 actual pointer of s3 in metadata Region2 meta data will be replaced by mybuket2 actual pointer of s3 in metadata and we just need to start the metastore in new region and it will work. Is this correct understanding

asheeshgarg avatar Sep 19 '22 18:09 asheeshgarg

yes, if you map both the bucket (present in different region) to a multi-region access-point.

can ref to this slack thread as well, where this idea originated : https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1645066803099319

Reigon1 meta data will be replaced by mybuket1 actual pointer of s3 in metadata Region2 meta data will be replaced by mybuket2 actual pointer of s3 in metadata

No, Let's say your table path is under mybucket1 so both mybucket1 in region1 and mybucket2 and region2 will have paths of mybucket1, inside the metadata files. It just at the time of S3 (GET / PUT) call we will replace mybucket1 reference with multi-region access point.

Now if you can use a mutli-region access-point pointing to mybucket1, and mybucket2. it acts a proxy and single global hostname between two and internally routes the request to location with lowest latency...

More about access-points here https://aws.amazon.com/s3/features/multi-region-access-points/

singhpk234 avatar Sep 20 '22 07:09 singhpk234

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Mar 20 '23 00:03 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Apr 06 '23 00:04 github-actions[bot]