docker-stacks icon indicating copy to clipboard operation
docker-stacks copied to clipboard

[ENH] - Delta Image

Open lymedo opened this issue 2 years ago • 7 comments

What docker image(s) is this feature applicable to?

all-spark-notebook

What changes are you proposing?

Can we get an image with Delta included?

https://docs.delta.io/latest/quick-start.html

How does this affect the user?

It would provide additional functionality.

Anything else?

No response

lymedo avatar Jul 09 '22 11:07 lymedo

@lymedo could you draft a PR for this feature? I've seen a few people here wanting this package, so it might make sense to add it for everyone. We will see the image size increase. We even have the recipe to add this package: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/recipes.html#enable-delta-lake-in-spark-notebooks

Also, I would suggest adding this package to pyspark-notebook, it makes more sense (and it will also end up all-spark-notebook).

mathbunnyru avatar Jul 09 '22 12:07 mathbunnyru

@mathbunnyru I'm afraid that I'm a newbie to docker so I'm probably not the right person to draft the PR.

Thanks for the link to the recipe. I have a go at building that image locally.

lymedo avatar Jul 09 '22 19:07 lymedo

Delta Lake doesn't have support for Spark 3.3.0 now

The best thing you can do is to use scipy-notebook then !pip install delta-spark

bjornjorgensen avatar Jul 09 '22 20:07 bjornjorgensen

I'd picked up on the Spark version for Delta so built the image using the following:

Dockerfile

FROM jupyter/pyspark-notebook:spark-3.2.1

ARG DELTA_CORE_VERSION="1.2.1"
RUN pip install --quiet --no-cache-dir delta-spark==${DELTA_CORE_VERSION} && \
     fix-permissions "${HOME}" && \
     fix-permissions "${CONDA_DIR}"

USER root

RUN echo 'spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension' >> "${SPARK_HOME}/conf/spark-defaults.conf" && \
    echo 'spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog' >> "${SPARK_HOME}/conf/spark-defaults.conf"

USER ${NB_UID}
RUN echo "from pyspark.sql import SparkSession" > /tmp/init-delta.py && \
    echo "from delta import *" >> /tmp/init-delta.py && \
    echo "spark = configure_spark_with_delta_pip(SparkSession.builder).getOrCreate()" >> /tmp/init-delta.py && \
    python /tmp/init-delta.py && \
    rm /tmp/init-delta.py

Notebook

from delta import *

from pyspark.sql import SparkSession

builder = SparkSession.builder.appName("Delta") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.getActiveSession()
data = spark.range(0, 5)
data.write.format("delta").save("test-table")
df = spark.read.format('delta').load('test-table')
df.show()

lymedo avatar Jul 10 '22 11:07 lymedo

I think Spark 3.3 support is coming quite soon. https://github.com/delta-io/delta/pull/1257

I will keep this issue open and tag it as upstream. When delta-spark will work fine, I will try to add it to the image and if the image size increase is not significant, I will merge it.

mathbunnyru avatar Jul 11 '22 17:07 mathbunnyru

The PR was merged, waiting for a release.

mathbunnyru avatar Aug 08 '22 13:08 mathbunnyru

Databricks 11.1 has support for Spark 3.3.0 and Delta Lake: 2.0.0 already.

Bidek56 avatar Aug 08 '22 21:08 Bidek56

DL 2.1 has been released. Do we need a PR to include it or it is upstream? Thx

Bidek56 avatar Sep 09 '22 14:09 Bidek56

I do have some issues with the reason why jupyter lab docker images should add delta lake. When delta lake have their own docker file  and delta lake is only a storage layer.  Can the best solution here be to make a docker-compose file where there are both a jupyterlab image and delta lake image? 

bjornjorgensen avatar Sep 11 '22 12:09 bjornjorgensen

  1. Not sure what the purpose of Delta docker file is. It doesn't even install Delta library.
  2. I think we need the Delta library as part of the Docker stack to read and write Delta lake tables, no?

Bidek56 avatar Sep 11 '22 13:09 Bidek56

  1. No. Its just like S3. https://docs.delta.io/latest/quick-start.html

bjornjorgensen avatar Sep 11 '22 14:09 bjornjorgensen

Maybe I am missing something but S3 is a storage engine like Hadoop, where Delta.io is a storage format which is an extension of Parquet that can be written to S3, Hadoop, Azure Blo, etc... No??

P.s. I use delta.io on a daily basis.

Bidek56 avatar Sep 11 '22 16:09 Bidek56

Since we have a recipe for Delta.io already, I would probably agree that we can close this issue.

Bidek56 avatar Sep 12 '22 16:09 Bidek56

Initially, I thought many people would like to see this package to be already included in the image. But it seems that it won't be easy to add it, because many people want to use latest spark versions (at least we get such PRs quite often) and delta is a bit slow in supporting latest versions.

I'm gonna close this issue. If we have many requests to add this package again, we should think about it.

mathbunnyru avatar Sep 12 '22 17:09 mathbunnyru