iceberg
iceberg copied to clipboard
Accessing Iceberg tables without catalog
Query engine
Spark
Question
I have a scenario where in I have loss access to catalog data. But I have access to iceberg metastore metadata and data. Is the a way to access the tables without catalog.
Do I need to create external tables in some other catalogs and point it to the location of warehouse?
If I understand correctly, you have lost your catalog data (e.g. the data in HMS or in your dynamodb table). Is that correct?
There's a RegisterTableProcedure that can be used to register an existing table into a catalog.
I'm not sure if the procedure is available in Iceberg 0.14.0, though you might be able to use the action directly from spark code (see the associated unit tests for the "action", which actually comes from BaseMetastoreCatalog). But you could otherwise try running one of the import procedures.
I think that's what you're looking for.
@kbendick yeah I lost access to HMS. But I have access to the warehouse folder in the s3 with data and metadata folders for all the tables generate by iceberg. Just looking at the ways to recreate the tables as well how to read the data correctly using the information available in the warehouse directory. I have enabled spark extensions and able to call the stored procedure. So what I am trying to call now is spark.sql("CALL register_table()") with parameters to table and metadata file. Table is something local I have created in my local catalog and giving it the metadata directory of the original warehouse.
spark-shell --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.local.type=hadoop
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
is these correct the right steps?
Icreated a local catalog and trying to register a table I get not supported error I am on spak 3.3 and iceberg 0.14 spark.sql("CALL local.system.register_table('table','s3a://test/')").show() java.lang.UnsupportedOperationException: Registering tables is not supported at org.apache.iceberg.catalog.Catalog.registerTable(Catalog.java:363) at org.apache.iceberg.CachingCatalog.registerTable(CachingCatalog.java:187) at org.apache.iceberg.spark.procedures.RegisterTableProcedure.call(RegisterTableProcedure.java:84)
The registerTable() functionality from https://github.com/apache/iceberg/pull/5037 didn't make it into 0.14.0. However, we do publish nightly snapshot versions off of master (0.15.0-SNAPSHOT) so as a workaround you could try and use that version. Otherwise you might have to wait for the next version to be released with this functionality.
not able to find the artifactory location for nightly build. Could you please point me to that.
You need to use the snapshot repository mentioned in https://infra.apache.org/repository-faq.html. The artifacts themselves are under https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/
Thanks I was able to test it out one issue is if I move the data backed by s3 bucket in a DR location and try to recreate it the s3 buckets has different names are there any tools to fix the locations references used in metadata with the new locations.
Hi, I would like to do (kind of) the same. Catalogs are very hard to maintain in my use case.
Is it possible to open Iceberg tables like simple Parquet stores in PySpark?
spark.read.format("iceberg").load(iceberg_path)
Hi, I would like to do (kind of) the same. Catalogs are very hard to maintain in my use case.
Is it possible to open Iceberg tables like simple Parquet stores in PySpark?
spark.read.format("iceberg").load(iceberg_path)
Currently you would always have to go through a catalog
Currently you would always have to go through a catalog
@nastra Will this change in the near future?
My aim is to provide people with a large sorted dataset that they can simply download, read and query, while avoiding unnnecessary sort shuffles when reading + joining on it.
Having to setup Spark catalogs makes this impossible or at least very hard.
@nastra How would I register an Iceberg dataset to the in-memory PySpark catalog?
@nastra How would I register an Iceberg dataset to the in-memory PySpark catalog?
@Hoeze you would have to create a catalog and then register the tables within that catalog (similar to https://github.com/apache/iceberg/issues/5512#issuecomment-1217163086). However, note that the registerTable() functionality from https://github.com/apache/iceberg/pull/5037 has not been part of an official release yet.
@Hoeze For your use case is it sufficient to have a "temp view" on your iceberg table that's just available in the spark session that registers it?
Assuming your Iceberg metadata is intact with a metadata/v#.metadata.json file under the base Iceberg table path, you should be able to do:
spark.read.format("iceberg").load(iceberg_path).createOrReplaceTempView("my_view")
spark.sql("select * from my_view")
The iceberg_path should be the parent directory of metadata/.
@dennishuo I am not sure if I understand your suggestion. Isn't your code snippet equal to mine?
# yours
spark.read.format("iceberg").load(iceberg_path).createOrReplaceTempView("my_view")
df1 = spark.table("my_view")
# mine
df2 = spark.read.format("iceberg").load(iceberg_path)
Now, if df1 == df2 it would be absolutely sufficient.
As I said, I just want to send an iceberg store to people and let them load it in PySpark in a single line of code.
@Hoeze Ah I assumed you were asking about how to access the dataframe from SparkSQL since your later question was how to register to the in-memory PySpark catalog.
Did you have trouble getting the basic spark.read.format("iceberg").load(iceberg_path) to work? That command should work fine to read individual Iceberg tables as dataframes, the same way you would read a directory full of Parquet files as a Parquet dataframe.
@dennishuo Thanks for this options spark.read.format("iceberg").load(iceberg_path).createOrReplaceTempView("my_view")
- I tried this yesterday for my use case but getting version hint file missing issue. We are using s3 as storage
- when you write the data back to the iceberg format from view does it work like the metadata will be evolve appropriately
@asheeshgarg Right, unfortunately, as I understand it, mutations on the existing iceberg table would require catalog integration, so the low-level dataframe load approach would just be for reads.
When I was using this myself, the missing version-hint error appeared to just be a "warning", and I was still successfully able to use the dataframe by ignoring the error message.
Under the hood, the version-hint.text (note that the spelling is indeed .text, not .txt: https://github.com/apache/iceberg/blob/dc5f5c38f871f119b79ba167f8c075fc825797b8/core/src/main/java/org/apache/iceberg/hadoop/Util.java#L44) is used by the default "HadoopCatalog" as a pointer to the "latest/official version" of table metadata. When the file is missing, Spark/Hadoop fallback to "listing" all the *.metadata.json files. You can see where the "warning" for missing version-hint is caught here and how it falls through to attempting to list here: https://github.com/apache/iceberg/blob/dbb8a404f6632a55acb36e949f0e7b84b643cede/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L325
As long as your v*.metadata.json filenames follow that naming convention of being monotonically increasing and fit in an int, the file-listing approach technically works in the absence of concurrent attempted writes from other engines. If you have tons (i.e., many thousands) of versioned metadata files in the metadata directories, this will be slow.
If you do need to worry about transactionality with lots of writers trying to "commit" new metadata.json files, you at the very least need those writers to correctly populate version-hint.text to serve as an "atomic commit" of the correct write.
Most ideally, you'd use another Catalog implementation -- one of the main reasons for having separate Catalog implementations is precisely to overcome the shortcomings of the default HadoopCatalog-based approach.
What system were you using to write the Iceberg tables in the first place?
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'
it's been almost a year, can't we read Iceberg tables from S3 path (without catalog) yet?
@jarias1 You can read a table directly from the metadata: https://iceberg.apache.org/javadoc/1.5.0/org/apache/iceberg/StaticTableOperations.html
This allow for read-only access. For writes a catalog is needed to handle concurrent operations.
spark.read.format("iceberg").load(iceberg_path) didn't work for me where iceberg_path is the parent folder of /metadata and /data
spark.read.format("iceberg").load("<iceberg path>/metadata/v012345.metadata.json") worked where we point to a specific version by reading the version-hint.text