iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Accessing Iceberg tables without catalog

Open asheeshgarg opened this issue 3 years ago • 7 comments

Query engine

Spark

Question

I have a scenario where in I have loss access to catalog data. But I have access to iceberg metastore metadata and data. Is the a way to access the tables without catalog.

Do I need to create external tables in some other catalogs and point it to the location of warehouse?

asheeshgarg avatar Aug 12 '22 17:08 asheeshgarg

If I understand correctly, you have lost your catalog data (e.g. the data in HMS or in your dynamodb table). Is that correct?

There's a RegisterTableProcedure that can be used to register an existing table into a catalog.

I'm not sure if the procedure is available in Iceberg 0.14.0, though you might be able to use the action directly from spark code (see the associated unit tests for the "action", which actually comes from BaseMetastoreCatalog). But you could otherwise try running one of the import procedures.

I think that's what you're looking for.

kbendick avatar Aug 15 '22 17:08 kbendick

@kbendick yeah I lost access to HMS. But I have access to the warehouse folder in the s3 with data and metadata folders for all the tables generate by iceberg. Just looking at the ways to recreate the tables as well how to read the data correctly using the information available in the warehouse directory. I have enabled spark extensions and able to call the stored procedure. So what I am trying to call now is spark.sql("CALL register_table()") with parameters to table and metadata file. Table is something local I have created in my local catalog and giving it the metadata directory of the original warehouse.

spark-shell --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.local.type=hadoop
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

is these correct the right steps?

asheeshgarg avatar Aug 15 '22 18:08 asheeshgarg

Icreated a local catalog and trying to register a table I get not supported error I am on spak 3.3 and iceberg 0.14 spark.sql("CALL local.system.register_table('table','s3a://test/')").show() java.lang.UnsupportedOperationException: Registering tables is not supported at org.apache.iceberg.catalog.Catalog.registerTable(Catalog.java:363) at org.apache.iceberg.CachingCatalog.registerTable(CachingCatalog.java:187) at org.apache.iceberg.spark.procedures.RegisterTableProcedure.call(RegisterTableProcedure.java:84)

asheeshgarg avatar Aug 16 '22 21:08 asheeshgarg

The registerTable() functionality from https://github.com/apache/iceberg/pull/5037 didn't make it into 0.14.0. However, we do publish nightly snapshot versions off of master (0.15.0-SNAPSHOT) so as a workaround you could try and use that version. Otherwise you might have to wait for the next version to be released with this functionality.

nastra avatar Aug 19 '22 09:08 nastra

not able to find the artifactory location for nightly build. Could you please point me to that.

asheeshgarg avatar Aug 19 '22 13:08 asheeshgarg

You need to use the snapshot repository mentioned in https://infra.apache.org/repository-faq.html. The artifacts themselves are under https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/

nastra avatar Aug 19 '22 14:08 nastra

Thanks I was able to test it out one issue is if I move the data backed by s3 bucket in a DR location and try to recreate it the s3 buckets has different names are there any tools to fix the locations references used in metadata with the new locations.

asheeshgarg avatar Sep 15 '22 20:09 asheeshgarg

Hi, I would like to do (kind of) the same. Catalogs are very hard to maintain in my use case.

Is it possible to open Iceberg tables like simple Parquet stores in PySpark? spark.read.format("iceberg").load(iceberg_path)

Hoeze avatar Sep 28 '22 09:09 Hoeze

Hi, I would like to do (kind of) the same. Catalogs are very hard to maintain in my use case.

Is it possible to open Iceberg tables like simple Parquet stores in PySpark? spark.read.format("iceberg").load(iceberg_path)

Currently you would always have to go through a catalog

nastra avatar Sep 28 '22 10:09 nastra

Currently you would always have to go through a catalog

@nastra Will this change in the near future?

Hoeze avatar Sep 28 '22 13:09 Hoeze

My aim is to provide people with a large sorted dataset that they can simply download, read and query, while avoiding unnnecessary sort shuffles when reading + joining on it.

Having to setup Spark catalogs makes this impossible or at least very hard.

Hoeze avatar Sep 28 '22 13:09 Hoeze

@nastra How would I register an Iceberg dataset to the in-memory PySpark catalog?

Hoeze avatar Sep 28 '22 14:09 Hoeze

@nastra How would I register an Iceberg dataset to the in-memory PySpark catalog?

@Hoeze you would have to create a catalog and then register the tables within that catalog (similar to https://github.com/apache/iceberg/issues/5512#issuecomment-1217163086). However, note that the registerTable() functionality from https://github.com/apache/iceberg/pull/5037 has not been part of an official release yet.

nastra avatar Sep 28 '22 15:09 nastra

@Hoeze For your use case is it sufficient to have a "temp view" on your iceberg table that's just available in the spark session that registers it?

Assuming your Iceberg metadata is intact with a metadata/v#.metadata.json file under the base Iceberg table path, you should be able to do:

spark.read.format("iceberg").load(iceberg_path).createOrReplaceTempView("my_view")
spark.sql("select * from my_view")

The iceberg_path should be the parent directory of metadata/.

dennishuo avatar Oct 08 '22 02:10 dennishuo

@dennishuo I am not sure if I understand your suggestion. Isn't your code snippet equal to mine?

# yours
spark.read.format("iceberg").load(iceberg_path).createOrReplaceTempView("my_view")
df1 = spark.table("my_view")
# mine
df2 = spark.read.format("iceberg").load(iceberg_path)

Now, if df1 == df2 it would be absolutely sufficient.

As I said, I just want to send an iceberg store to people and let them load it in PySpark in a single line of code.

Hoeze avatar Oct 11 '22 10:10 Hoeze

@Hoeze Ah I assumed you were asking about how to access the dataframe from SparkSQL since your later question was how to register to the in-memory PySpark catalog.

Did you have trouble getting the basic spark.read.format("iceberg").load(iceberg_path) to work? That command should work fine to read individual Iceberg tables as dataframes, the same way you would read a directory full of Parquet files as a Parquet dataframe.

dennishuo avatar Oct 11 '22 16:10 dennishuo

@dennishuo Thanks for this options spark.read.format("iceberg").load(iceberg_path).createOrReplaceTempView("my_view")

  1. I tried this yesterday for my use case but getting version hint file missing issue. We are using s3 as storage
  2. when you write the data back to the iceberg format from view does it work like the metadata will be evolve appropriately

asheeshgarg avatar Oct 11 '22 16:10 asheeshgarg

@asheeshgarg Right, unfortunately, as I understand it, mutations on the existing iceberg table would require catalog integration, so the low-level dataframe load approach would just be for reads.

When I was using this myself, the missing version-hint error appeared to just be a "warning", and I was still successfully able to use the dataframe by ignoring the error message.

Under the hood, the version-hint.text (note that the spelling is indeed .text, not .txt: https://github.com/apache/iceberg/blob/dc5f5c38f871f119b79ba167f8c075fc825797b8/core/src/main/java/org/apache/iceberg/hadoop/Util.java#L44) is used by the default "HadoopCatalog" as a pointer to the "latest/official version" of table metadata. When the file is missing, Spark/Hadoop fallback to "listing" all the *.metadata.json files. You can see where the "warning" for missing version-hint is caught here and how it falls through to attempting to list here: https://github.com/apache/iceberg/blob/dbb8a404f6632a55acb36e949f0e7b84b643cede/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L325

As long as your v*.metadata.json filenames follow that naming convention of being monotonically increasing and fit in an int, the file-listing approach technically works in the absence of concurrent attempted writes from other engines. If you have tons (i.e., many thousands) of versioned metadata files in the metadata directories, this will be slow.

If you do need to worry about transactionality with lots of writers trying to "commit" new metadata.json files, you at the very least need those writers to correctly populate version-hint.text to serve as an "atomic commit" of the correct write.

Most ideally, you'd use another Catalog implementation -- one of the main reasons for having separate Catalog implementations is precisely to overcome the shortcomings of the default HadoopCatalog-based approach.

What system were you using to write the Iceberg tables in the first place?

dennishuo avatar Oct 11 '22 18:10 dennishuo

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Apr 10 '23 00:04 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Apr 28 '23 00:04 github-actions[bot]

it's been almost a year, can't we read Iceberg tables from S3 path (without catalog) yet?

jarias1 avatar Mar 11 '24 20:03 jarias1

@jarias1 You can read a table directly from the metadata: https://iceberg.apache.org/javadoc/1.5.0/org/apache/iceberg/StaticTableOperations.html

This allow for read-only access. For writes a catalog is needed to handle concurrent operations.

Fokko avatar Mar 12 '24 20:03 Fokko

spark.read.format("iceberg").load(iceberg_path) didn't work for me where iceberg_path is the parent folder of /metadata and /data spark.read.format("iceberg").load("<iceberg path>/metadata/v012345.metadata.json") worked where we point to a specific version by reading the version-hint.text

forestfang-stripe avatar May 14 '24 19:05 forestfang-stripe