iceberg
iceberg copied to clipboard
<docs> Getting started with Iceberg & Spark
As a newbie on Apache Iceberg universe, I am eager to try out the functionality exposed by the framework.
It is not quite straightforward to get to setup an Icerberg environment on Spark. After downloading the spark 3.1.2 distribution, I configured spark-defaults.conf
spark.jars.packages org.apache.iceberg:iceberg-spark3-runtime:0.12.1
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.demo org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.demo.catalog-impl org.apache.iceberg.jdbc.JdbcCatalog
spark.sql.catalog.demo.uri jdbc:postgresql://postgres:5432/demo_catalog
spark.sql.catalog.demo.jdbc.user admin
spark.sql.catalog.demo.jdbc.password password
spark.sql.catalog.demo.io-impl org.apache.iceberg.hadoop.HadoopFileIO
spark.sql.catalog.demo.warehouse /home/iceberg/warehouse
spark.sql.defaultCatalog demo
Afterwards I did setup postgres to run on a docker container
docker run --name iceberg-spark-postgres -e POSTGRES_USER=admin -e POSTGRES_PASSWORD=password -e POSTGRES_DB=demo_catalog -p 5432:5432 -d postgres
While trying out the scenarios exposed on the page https://iceberg.apache.org/#maintenance/
it is mentioned in the code snippets:
Table table = ...
Getting the Iceberg table for a Spark Catalog is not that straightforward. After digging up though the Iceberg source code I stitched together this snippet for obtaining the table:
import org.apache.spark.sql.connector.catalog.Identifier
val sparkCatalog = spark.sessionState.catalogManager.currentCatalog.asInstanceOf[org.apache.iceberg.spark.SparkCatalog]
val sparkTableTest1 = sparkCatalog.loadTable(Identifier.of(Array[String](""), "test1"))
val icebergTableTest1 = sparkTableTest1.table
What I'd like to have (as a newbie) on Iceberg is a Docker image / Docker compose to get started with Spark. Having everything packed together and ready to be used is much easier for a newbie to get started.
For the code samples I'd very much appreciate having also the SparkCatalog
in java/scala/python examples for a series of general usage scenarios that are not covered by SQL commands for Iceberg.
We actually have a helper for getting the underlying table see https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java#L644
But I believe we would probably recommend using the procedure api https://iceberg.apache.org/spark-procedures/#_top
From spark.
That said it would be great if we could have some better samples and such and a more complete docker image
@RussellSpitzer indeed. I came across by loadTable
method of Spark3Util
, but being at the moment of trying stuff out on spark 3.1.2 (AFAIK iceberg 0.12.1 doesn't work fully atm with spark 3.2) I didn't have this method available in the spark-shell
.
In any case, simple examples that just work (this comes obviously with a maintenance cost) would be a definite win to grow the community around Iceberg.
Thank you for the feedback.
In any case, simple examples that just work (this comes obviously with a maintenance cost) would be a definite win to grow the community around Iceberg.
Cannot agree more. The community is working on the docker image. It will be released pretty soon. Cc @samredai and @kbendick for more details.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'