iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

<docs> Getting started with Iceberg & Spark

Open findinpath opened this issue 3 years ago • 5 comments

As a newbie on Apache Iceberg universe, I am eager to try out the functionality exposed by the framework.

It is not quite straightforward to get to setup an Icerberg environment on Spark. After downloading the spark 3.1.2 distribution, I configured spark-defaults.conf

spark.jars.packages                    org.apache.iceberg:iceberg-spark3-runtime:0.12.1
spark.sql.extensions                   org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.demo                 org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.demo.catalog-impl    org.apache.iceberg.jdbc.JdbcCatalog
spark.sql.catalog.demo.uri             jdbc:postgresql://postgres:5432/demo_catalog
spark.sql.catalog.demo.jdbc.user       admin
spark.sql.catalog.demo.jdbc.password   password
spark.sql.catalog.demo.io-impl         org.apache.iceberg.hadoop.HadoopFileIO
spark.sql.catalog.demo.warehouse       /home/iceberg/warehouse
spark.sql.defaultCatalog               demo

Afterwards I did setup postgres to run on a docker container

docker run --name iceberg-spark-postgres -e POSTGRES_USER=admin -e POSTGRES_PASSWORD=password -e POSTGRES_DB=demo_catalog -p 5432:5432 -d postgres

While trying out the scenarios exposed on the page https://iceberg.apache.org/#maintenance/

it is mentioned in the code snippets:

Table table = ...

Getting the Iceberg table for a Spark Catalog is not that straightforward. After digging up though the Iceberg source code I stitched together this snippet for obtaining the table:

import org.apache.spark.sql.connector.catalog.Identifier

val sparkCatalog = spark.sessionState.catalogManager.currentCatalog.asInstanceOf[org.apache.iceberg.spark.SparkCatalog]

val sparkTableTest1 = sparkCatalog.loadTable(Identifier.of(Array[String](""), "test1"))

val icebergTableTest1 = sparkTableTest1.table

What I'd like to have (as a newbie) on Iceberg is a Docker image / Docker compose to get started with Spark. Having everything packed together and ready to be used is much easier for a newbie to get started.

For the code samples I'd very much appreciate having also the SparkCatalog in java/scala/python examples for a series of general usage scenarios that are not covered by SQL commands for Iceberg.

findinpath avatar Jan 21 '22 21:01 findinpath

We actually have a helper for getting the underlying table see https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java#L644

But I believe we would probably recommend using the procedure api https://iceberg.apache.org/spark-procedures/#_top

From spark.

That said it would be great if we could have some better samples and such and a more complete docker image

RussellSpitzer avatar Jan 21 '22 21:01 RussellSpitzer

@RussellSpitzer indeed. I came across by loadTable method of Spark3Util, but being at the moment of trying stuff out on spark 3.1.2 (AFAIK iceberg 0.12.1 doesn't work fully atm with spark 3.2) I didn't have this method available in the spark-shell.

In any case, simple examples that just work (this comes obviously with a maintenance cost) would be a definite win to grow the community around Iceberg.

Thank you for the feedback.

findinpath avatar Jan 21 '22 21:01 findinpath

In any case, simple examples that just work (this comes obviously with a maintenance cost) would be a definite win to grow the community around Iceberg.

Cannot agree more. The community is working on the docker image. It will be released pretty soon. Cc @samredai and @kbendick for more details.

flyrain avatar Jan 23 '22 23:01 flyrain

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Aug 10 '22 00:08 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Aug 24 '22 00:08 github-actions[bot]