docker-stacks
docker-stacks copied to clipboard
[ENH] - Delta Image
What docker image(s) is this feature applicable to?
all-spark-notebook
What changes are you proposing?
Can we get an image with Delta included?
https://docs.delta.io/latest/quick-start.html
How does this affect the user?
It would provide additional functionality.
Anything else?
No response
@lymedo could you draft a PR for this feature? I've seen a few people here wanting this package, so it might make sense to add it for everyone. We will see the image size increase. We even have the recipe to add this package: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/recipes.html#enable-delta-lake-in-spark-notebooks
Also, I would suggest adding this package to pyspark-notebook, it makes more sense (and it will also end up all-spark-notebook).
@mathbunnyru I'm afraid that I'm a newbie to docker so I'm probably not the right person to draft the PR.
Thanks for the link to the recipe. I have a go at building that image locally.
Delta Lake doesn't have support for Spark 3.3.0 now
The best thing you can do is to use scipy-notebook then !pip install delta-spark
I'd picked up on the Spark version for Delta so built the image using the following:
Dockerfile
FROM jupyter/pyspark-notebook:spark-3.2.1
ARG DELTA_CORE_VERSION="1.2.1"
RUN pip install --quiet --no-cache-dir delta-spark==${DELTA_CORE_VERSION} && \
fix-permissions "${HOME}" && \
fix-permissions "${CONDA_DIR}"
USER root
RUN echo 'spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension' >> "${SPARK_HOME}/conf/spark-defaults.conf" && \
echo 'spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog' >> "${SPARK_HOME}/conf/spark-defaults.conf"
USER ${NB_UID}
RUN echo "from pyspark.sql import SparkSession" > /tmp/init-delta.py && \
echo "from delta import *" >> /tmp/init-delta.py && \
echo "spark = configure_spark_with_delta_pip(SparkSession.builder).getOrCreate()" >> /tmp/init-delta.py && \
python /tmp/init-delta.py && \
rm /tmp/init-delta.py
Notebook
from delta import *
from pyspark.sql import SparkSession
builder = SparkSession.builder.appName("Delta") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.getActiveSession()
data = spark.range(0, 5)
data.write.format("delta").save("test-table")
df = spark.read.format('delta').load('test-table')
df.show()
I think Spark 3.3 support is coming quite soon. https://github.com/delta-io/delta/pull/1257
I will keep this issue open and tag it as upstream.
When delta-spark will work fine, I will try to add it to the image and if the image size increase is not significant, I will merge it.
The PR was merged, waiting for a release.
Databricks 11.1 has support for Spark 3.3.0 and Delta Lake: 2.0.0 already.
DL 2.1 has been released. Do we need a PR to include it or it is upstream? Thx
I do have some issues with the reason why jupyter lab docker images should add delta lake. When delta lake have their own docker file and delta lake is only a storage layer. Can the best solution here be to make a docker-compose file where there are both a jupyterlab image and delta lake image?
- Not sure what the purpose of Delta docker file is. It doesn't even install Delta library.
- I think we need the Delta library as part of the Docker stack to read and write Delta lake tables, no?
- No. Its just like S3. https://docs.delta.io/latest/quick-start.html
Maybe I am missing something but S3 is a storage engine like Hadoop, where Delta.io is a storage format which is an extension of Parquet that can be written to S3, Hadoop, Azure Blo, etc... No??
P.s. I use delta.io on a daily basis.
Since we have a recipe for Delta.io already, I would probably agree that we can close this issue.
Initially, I thought many people would like to see this package to be already included in the image. But it seems that it won't be easy to add it, because many people want to use latest spark versions (at least we get such PRs quite often) and delta is a bit slow in supporting latest versions.
I'm gonna close this issue. If we have many requests to add this package again, we should think about it.