iceberg
iceberg copied to clipboard
Support relative paths in Table Metadata
Background
Iceberg specification captures file references for the following:
- Table metadata references
locationthat determines the base location of the table. - Snapshot references
manifestsandmanifest-listto determining the manifests that make up the snapshot. - Manifest list references
manifest_paththat identifies the location of a manifest file. - Manifest references
file_paththat identifies a data file. - Position based delete file references
file_pathidentifies a data file on which position based delete is to be applied.
All the file references are absolute paths right now.
Challenge
Table copy to another location could arise in the following use cases:
- Replication: Copy the table state and history of state changes to another data center or availability zone.
- Backup: Copy the table state and history of state changes to an archive storage for backup and recovery purposes.
Absolute file references require the file references to reflect the new target location of the table before the table can be consumed.
Solution Option
We could support relative paths as follows:
- Table metadata
locationshall always reference the absolute location of the table - All other path references shall support both relative and absolute references.
- In case of a relative reference this shall be relative to the
locationfrom the table metadata. - In case of an absolute reference then it is used directly
- In case of a relative reference this shall be relative to the
We might further consider splitting the Table metadata into two pieces:
- Definition: This identifies elements of the metadata that do not usually change:
format-versiontable-uuidlocation
- Transactional: This identifies elements of the metadata that are expected to change with every transaction
sequence-numbercurrent-snapshot-idschema- etc
This will ensure the following:
- Initial replication requires the revision to the
locationattribute in the table metadata. - All subsequent replications do not require any manipulation of any file reference and can happen incrementally.
I remember we discussed this use case some time ago in a sync up meeting, and the related issue was #1531. The general feedback was that it is better to expose this as a Spark procedure, and rewrite the manifests with new file path URIs to reflect new location after replication or backup. This avoids the trouble of redefining the Iceberg spec. Any thoughts?
I don't think it is a good idea in general to use relative paths. We recently had an issue where using a hdfs location without authority caused a user's data to be deleted by the RemoveOrphanFiles action because the resolution of the table root changed. The main problem is that places in Iceberg would need to have some idea of "equivalent" paths and path resolution. Full URIs are much easier to work with and more reliable.
But there is still a way to do both. Catalogs and tables can inject their own FileIO implementation, which is what is used to open files. That can do any resolution that you want based on environment. So you could use an implementation that allows you to override a portion of the file URI and read it from a different underlying location. I think that works better overall because there are no mistakes about equivalent URIs, but you can still read a table copy without rewriting the metadata.
@jackye1995 and @rdblue Thanks for your feedback.
Our current thought process is as follows:
- Only the Table Path be an absolute path, with all other paths in the metadata files being relative paths
- The relative path should not be visible outside of the metadata component i.e. the relative path should be translated to an absolute path using the table path for other areas of Iceberg to consume, so only the metadata files shall store the relative path.
- The relative path references shall be confined to be within the table path i.e
../shall not be allowed.
I don't think the scenario called out about a wrong HDFS location being cleaned will be worsened or improved as a result of this change. If the absolute path of the table is incorrectly interpreted to be something else then problems shall still happen.
We will explore the option of FileIO in parallel while we discuss further the complications of using relative paths.
Relative paths would help us too.
We certainly need to have support for a federated namespace with a virtual file system. We're moving to HDFS federation and moving our tables to be located in a virtual filesystem (eg: gridfs://cluster1/table ) so that we can move the data around without changing all of the metadata. The challenge here is that we need to make sure that the manifests don't contain the physical hdfs path, which is likely to break.
For data access control, it seems like a really good invariant that all data is under the table's path.
I'll add this as a discussion topic for the next Iceberg sync.
We've put together a design doc for this proposal. We thought through a few scenarios and added an implementation. We will be glad to consider other ideas as well.
Thanks for the design doc! I wonder if this can be done through changes in Catalog and FileIO without touching the Iceberg specification, which can be much simpler to ship without conflicting with already quite a few changes for the format v2 specifications.
The key difficulty I see is that table property is a part of the table metadata file. Therefore we cannot know if a table should use any special file path override without actually reading that file. If we have a way to know it without using FileIO, then we can decide for a specific table, all files should be read through a region-specific client for replication, or a different URI scheme for backup and HDFS federation.
So here is my alternative proposal:
- add interface
TablePropertyIOthat is dedicated to read and write table properties. - update catalog implementations (mostly in
BaseMetastoreCatalog) to useTablePropertyIOwhen necessary, by adding it as a part of theTableOperationsinterface similar toFileIO io(). - add new table property
table-io-impl, that can be customized to do things like replicaiton, backup, etc. - implement those specific table
FileIOclasses.
Could this proposal satisfy your requirements?
Thanks for your comments @jackye1995. We discussed your proposal and here are our thoughts.
- Making no changes to the file paths and relying on
FileIOimplementations to internally interpret them differently can cause confusion since the same path could mean totally different locations based on the underlying implementation.
With our approach,
- The path we see in metadata files is the path that gets acted upon. We feel this is cleaner and causes less confusion.
- Further, when we use relative paths, they will always be relative to the table. This invariant makes it easier to interpret the paths.
For the reasons above, we prefer the approach laid out in the design doc. We understand the concerns around Iceberg spec with v2 format already changing a lot. We are open to ideas on shipping this with the least interference (even if it means we wait till v2 gets out).
Please let us know your thoughts.
I read the doc through the Proposal and Approach sections and I am quite confused about what your proposal is. Could you clarify it a bit? It may be obvious from the following example sections what the changes are and why they are needed, but I think that information should be in the proposal/approach.
Thanks for the initial feedback @rdblue. I rephrased the proposal and approach sections. Could you please take another look at the document?
Thanks, Anurag
@flyrain, could you take a look at what can be done here?
@flyrain, in my latest comment to Ryan's comment in the design doc, I proposed starting with "supporting relative paths in metadata" which does not need changes to the spec. We can then look at refactoring the fields of metadata.json file. This will make it easier to implement and review the change. What are your thoughts?
Thanks @anuragmantri for filing the PR and updating the design doc. It is great that there is no spec change in the latest design.
I looked through the design doc. It seems to be much better than the last version, so thank you for the update.
Overall, I think it is still incorrect on a few points and is quite a bit longer than it will need to be in the end. The main confusion seems to be where the table location comes from. Iceberg table metadata tracks a location string that is written into all metadata.json files. This is the table's location. Iceberg does not require any location other than this one and there isn't a way to pass a different location in the TableOperations API; the only way to pass a table location is through TableMetadata.
There are also a few of table properties that can affect locations:
- Metadata can be redirected by setting
write.metadata.path, which is defaulted to themetadatafolder under the table location - The default location provider writes data underneath
write.folder-storage.path, which is defaulted to thedatafolder under the table location - The object storage location provider shards data underneath
write.object-storage.path
There are two broad classes of tables:
- Hadoop tables identify the table itself by location and enforce that the location in
TableMetadatais identical to the location used to identify the table when it is created. For Hadoop tables, it makes sense that the table location used for relative paths is the location that gets passed to createHadoopTableOperations. But then there could be a problem that table metadata actually points to a different location unless it is modified. - Metastore tables do not have a location that is external to table metadata. However, when a metastore table is tracked by Hive, there is a location that Hive tracks and we set that to the table's location.
For table metadata files, there is only an external location for Hadoop tables. Otherwise, the table location must come from the table's metadata.json file. That makes it difficult to make metadata_location and previous_metadata_location relative paths. I would probably not attempt to make these paths relative, but I'm open to ideas.
We could update the spec to allow relative metadata.json paths for Hadoop tables, but will need to state how to handle a different location in the metadata file. (Ignore?)
Metadata locations aside, I think that it is a good plan to make the relative paths transparent. Most of the library should continue to operate on absolute paths and relative paths should be made while writing and made absolute while reading. That makes it so the changes are fairly easily tested. If you do this, then I think there is a lot less to cover in the design doc.
@flyrain and @anuragmantri, see my comment above. Thanks for working on this!
@rdblue: We would like to use relative path for replication of the Iceberg tables between on-prem Hive instances, or moving on-prem Iceberg tables to the cloud and sometimes even moving data from the cloud to on-prem.
While we can do this by rewriting the manifest and snapshot files, it would be much easier to synchronize only the files themselves and update the metadata_location / previous_metadata_location for the target table and be done with it.
Because of the use-cases above it would be useful to have relative paths in the Iceberg metadata for Metastore tables as well, and the absolute path could be generated using the Metastore table location. Maybe even the metadata_location / previous_metadata_location could be generated this way, so the tables on both end of the replication could have the same data and metadata too.
@pvary, we can explore that, but it is fairly easy to write a new metadata file and use that for a table. At least that's a small operation, compared to rewriting all metadata and position delete files. My main point is that we need to have a well-documented plan for tracking the table location.
Thanks @rdblue for explaining how locations are tracked in Iceberg tables. Based on your and @pvary's comments, I updated the design doc to include
-
For HadoopTables
- The location passed to
HadoopTableOperationswill be used as base table location. - If metadata.json file contains a different
location, we will ignore it. (Do you see any issues?)
- The location passed to
-
For metastore tables
- For
HiveTable- We can use location property tracked in metastore as the base table location. OR
- We could just use
updateLocation()API to create a new metadata.json file and use thelocationattribute in that as base table location.
- For other metastore tables
- For metastore tables that don't track table location, we can use the
locationattribute of the metadata.json file as the base table location
- For metastore tables that don't track table location, we can use the
- For
-
Path conversion is transparent to APIs. I removed sections of the doc that refer to APIs that do not change.
What are your thoughts?
For table metadata files, there is only an external location for Hadoop tables. Otherwise, the table location must come from the table's metadata.json file. That makes it difficult to make metadata_location and previous_metadata_location relative paths. I would probably not attempt to make these paths relative, but I'm open to ideas.
Hi @rdblue, we might leave metadata_location and previous_metadata_location as is, they are table properties in Metastore like HMS, are not affected when we just move files from the source table to the target. They are absolute path, but they still point to the right place without any change.
@pvary, we can explore that, but it is fairly easy to write a new metadata file and use that for a table. At least that's a small operation, compared to rewriting all metadata and position delete files.
@rdblue: I am not sure that I understand what are you proposing here. I might be mistaken, but AFAIK we store the locations of the files in the snapshot file, in the manifest files and also in v2 the position delete files as well. If we want to create a usable full copy of the table then currently we need to do the following:
- Copy the data files
- Recreate or parse and replace the path in all of the files mentioned above (
snapshot,manifestfiles andposition deletefiles)
Is there an easier way to create a full working replica of an Iceberg table where we do not use any files/data from the original table and the 2 tables (original and the new) can live independently after the creation of the replica?
My main point is that we need to have a well-documented plan for tracking the table location.
Totally agree with this 👍
Thanks, Peter
@pvary, I'm talking only about the metadata.json files. I agree that using relative paths for manifest lists, manifests, and data files is a good idea. But we keep the table location in the root metadata file and so it is difficult to make its location relative.
I think we could do this for Hadoop tables by ignoring the location in metadata.json because it must match the table location. But Hadoop tables are mostly for testing and I don't recommend using them in practice. The bigger issue is how to track the table location for Hive or other metastore tables. In that case, I don't think it is unreasonable to use updateLocation to change the location of the table, but it is definitely inconvenient.
Is there an easier way to create a full working replica of an Iceberg table where we do not use any files/data from the original table and the 2 tables (original and the new) can live independently after the creation of the replica?
@pvary , ideally, table replication doesn't involve data file rewrite and metadata(manifest-list, manifest, metadata.json) rewrite. The process would be as simple as that user copys all files needed, then changes the target table properties to get the new status. It isn't the case in reality though.
In this issue thread, we were talking about two ways to replicate a table. 1. relative path 2. rebuild the metadata files. Neither of them require data file rewrite. However, the relative-path approach requires the minimal metadata file rewrite, probably only metadata.json per our discussion. But metadata-rebuild approach involves rewrite of all three type of metadata files. They are metadata.json, manifest-list, and manifest. Every type of file stores table information cannot be recreated only by looking at the data files. For example, the partition spec in metadata.json and its id in manifest file, and the snapshot relative metadata.
To your question, both source and target tables should be able to live independently after the replication. That's relative easy to archive. The hard part is to enable incremental sync-up between them and bidirectional replication, which are quite common DR(Disaster recovery) use cases.
Thanks for the detailed answer @flyrain! This really helps to have a clear understanding of the tasks at hand.
I would like to share my thoughts about 2 points:
The hard part is to enable incremental sync-up between them and bidirectional replication, which are quite common DR(Disaster recovery) use cases.
I think that if we chose the relative path approach then replication becomes quite straightforward, since we do not have to handle the mix of different paths (source absolute path for the new metadata files, destination absolute path for the old metadata files). We just need to copy the metadata files for one directional replication. The bidirectional replication is a different kettle of fish because of the commit resolution complexity, but I think it is also easier since we do not have to care about manifests and manifest-lists
However, the relative-path approach requires the minimal metadata file rewrite, probably only metadata.json per our discussion.
What do we want to change in the json? Is it only the path of the table, or we have to rewrite something else as well? Could we use something like the LocationProvider (which generates new datafile locations) to make the path resolution pluggable and store only the config in the table?
Thanks, Peter
Just trying to catch up with the thread, and reread the design doc.
For the design doc, I think we should not make HiveTable a separated case to discuss, all tables that are retrieved through a Catalog implementation should operate the same if we introduce relative path, we should not try to break the abstraction layer and treat Hive tables separately.
To have a brief summary (and also potentially answer Peter's question), from my understanding we are leading towards a solution that following items to keep track of the true table roots, and should be absolute, while all other paths can be relative:
locationwrite.metadata.pathwrite.folder-storage.pathwrite.object-storage.path
Then these 4 properties are used to get 2 absolute root paths:
- root path to write data: the
LocationProviderwould read some fields in this list to determine the base root path, and generate absolute paths to write data files. - root path to write metadata: similarly, the
TableOperations.metadataFileLocationreads some fields above to determine the base root path, and generate paths for metadata files.
These probably should be exposed as new methods to make sure the same location derivation logic is used everywhere.
Once the content of a file is written, the paths are written to one metadata layer above with/without absolute path based on a flag proposed in the design doc, and the writer of this path needs to know what are the true root paths used in the 2 places above. On the read side, the absolute path is created based on the 2 absolute root paths. Those operations are done inside each specific IO method by passing the relative/absolute path, the correct absolute root path, and the boolean flag, to make sure operations are transparent above the IO level.
All updates to the 4 locations on the top can be done in a UpdateProperties + UpdateLocation transaction, and when the metadata path is updated, the metadata_location in catalog will be automatically updated with the new file root path to write and then read the latest metadata json file that is now in anther root path. Therefore, we need to make sure that when this value is updated, the metadata json file writer uses the value in the update instead of the existing value. This should be able to switch a reader of Iceberg table from one replica to another explicitly, or implement try-and-error logic to switch to the replica after certain retries. I am not sure if there is a use case such as choosing a random replica to read file, but I think that should be solved by directly registering all replicas as different tables and round-robin on the reader side.
I think if we go with this route then it sounds good to me, and we can try to hash out some more details in PRs. Please let me know if this summary is accurate, or if I have any misunderstanding.
Thanks @jackye1995 for highlighting the properties which we need to handle!
I have not know about the write.object-storage.path before. Do I understand correctly that this is a LocationProvider specific parameter for ObjectStoreLocationProvider which is more or less the same than the write.object-storage.path for the DefaultLocationProvider? If so it highlights how tricky could it be if we want to make sure that every LocationProvider specific parameter is handled correctly when we are updating the locations in the metadata json. Maybe we should delegate this path generation to the LocationProvider altogether, or something similar.
All updates to the 4 locations on the top can be done in a UpdateProperties + UpdateLocation transaction
The process you have described above sounds useful to me if we have a place where both of the replicas are accessible, like a central node where we can initiate the copy of the data and then call UpdateProperties + UpdateLocation, so the replica could have it's own metadata json which holds the correct info.
In our cases for the on-prem to cloud migration it is possible that the file copy is done by a different method and then we need to read and update the metadata json file, in which the locations are not accessible. So it would be good if we can run UpdateProperties + UpdateLocation on an "invalid" table.
If so it highlights how tricky could it be if we want to make sure that every LocationProvider specific parameter is handled correctly when we are updating the locations in the metadata json
Yes, that is why I was proposing if we decide to go with relative path, LocationProvider should expose a method that describes the true file path root it is using, maybe something like LocationProvider.root(), and any plugin implementation can override this method if necessary. But LocationProvider only governs the path for data files, and metadata file paths are governed by TableOperations.metadataFileLocation(String fileName), so we should also have a new method like TableOperations.metadataRoot()` to describe the true root. Those 2 roots should be sufficient for any reader and writer to use if they stick with the Iceberg core IO library.
On Trino side it's quite more complicated because some paths are hard coded and directly generated based on Hive conventions, but I think that is an issue to fix on Trino side: https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java#L303-L308
But just keep in mind that there is always a risk for a compute engine to not stick with those things, and it will be hard to enforce.
@pvary, yes, the relative path approach looks more elegant than the metadata rebuild one. As I mentioned, it is close to the ideal case, which doesn't rewrite the metadata files.
What do we want to change in the json? Is it only the path of the table, or we have to rewrite something else as well? Could we use something like the LocationProvider (which generates new datafile locations) to make the path resolution pluggable and store only the config in the table?
Need to change the location in metadata.json, or a combination of three table properties mentioned by @rdblue and @jackye1995.
Thanks @jackye1995 's proposal. It makes sense to have a centralized place to get the root for data and metadata since there are multiple ways to specify their locations. We can also hide the logic of how to prioritize them. cc @anuragmantri.
Why just get the root for the data/metadata from the LocationProvider?
Why not expose these methods instead:
// initialize the provider with the root paths and the `useRelativePath` config,
// or have different implementations like:
// - AbsoluteLocationProvider,
// - RelativeLocationProvider?
LocationProvider(Properties tableProperties);
// generate the absolute path for the metadata files
Path metaDataLocation(String file);
// generate the absolute path for the metadata files
Path dataLocation(String file);
// generate the value we put into the avro files - absolute or relative path
String metaDataKey(Path metadataFile);
// generate the value we put into the avro files - absolute or relative path
String dataKey(Path metadataFile);
Catching up on the comments from past couple of days. Thanks everyone for providing inputs.
@flyrain - Although it will increase the scope of this design to include bi-directional replication, we should consider covering that here.
Overall, I agree with @jackye1995's proposal. Thanks for the detailed explanation.
Then these 4 properties are used to get 2 absolute root paths:
root path to write data: the LocationProvider would read some fields in this list to determine the base root path, and generate absolute paths to write data files. root path to write metadata: similarly, the TableOperations.metadataFileLocation reads some fields above to determine the base root path, and generate paths for metadata files.
In the initial design doc, my thinking was the writes mentioned above need to change at all. However, reading the comments from @pvary above, we may indeed have to store a root location somewhere for cases where we want to read inaccessible paths. Whether we want to have explicit method to set this root before accessing a table or have a mechanism for replicated tables to register themselves as replicas and then use a retry logic can be discussed.
My takeaways from this are:
-
Metadata writers need to know a) Relative path boolean b) True root of the table c) Metadata path if set Additionally, they will also convert paths to relative in metadata files.
-
Data writers need to know a) Data path if set b) True root of the table
An update transaction will change all the above except relative path boolean. Please correct me if my understanding is incorrect.
Why not expose these methods instead:
// initialize the provider with the root paths and the
useRelativePathconfig, // or have different implementations like: // - AbsoluteLocationProvider, // - RelativeLocationProvider? LocationProvider(Properties tableProperties);// generate the absolute path for the metadata files Path metaDataLocation(String file); // generate the absolute path for the metadata files Path dataLocation(String file);
// generate the value we put into the avro files - absolute or relative path String metaDataKey(Path metadataFile); // generate the value we put into the avro files - absolute or relative path String dataKey(Path metadataFile);
@pvary - This may make the code change larger. Other than that, I don't see why this cannot be done unless I'm missing something.