delta icon indicating copy to clipboard operation
delta copied to clipboard

Publish a docker image for Delta Lake

Open zsxwing opened this issue 3 years ago • 10 comments

Currently it's not convenient to try out Delta Lake. People need to install Spark first, and following multiple steps in https://docs.delta.io/latest/quick-start.html#set-up-apache-spark-with-delta-lake

It would be great if we can publish a docker image for Delta Lake so that people can try it out using a simple docker command.

For such docker image, we can maintain a Dockerfile in the GitHub repo and publish a new image to https://hub.docker.com/u/deltaio in each release.

zsxwing avatar Feb 01 '22 23:02 zsxwing

Do we have to write a dockerfile?

RishiKumarRay avatar Feb 02 '22 03:02 RishiKumarRay

@RishiKumarRay I think it's better to maintain a dockerfile in GitHub and we build and publish a new image when making a new release. Do you have any other suggestion?

zsxwing avatar Feb 02 '22 05:02 zsxwing

@zsxwing yeah i think its great to maintain a dockerfile. i can see a dockerfile here

RishiKumarRay avatar Feb 02 '22 06:02 RishiKumarRay

so we need a github actions to publish images

RishiKumarRay avatar Feb 02 '22 07:02 RishiKumarRay

@zsxwing what is the intended entrypoint for the docker image? the various different shell options?

stikkireddy avatar Feb 02 '22 15:02 stikkireddy

so we need a github actions to publish images

No. We don't need a github actions. Currently, our release process is manual. We will build the image from the dockerfile manually when making a new release.

what is the intended entrypoint for the docker image? the various different shell options?

Yep. I think the default can be pyspark. We can also provide options for spark-shell, or even a jupyter notebook.

zsxwing avatar Feb 02 '22 17:02 zsxwing

@zsxwing @RishiKumarRay - any updates on this issue?

I'm looking for a simple Java (non-Spark) Docker Container that let's anyone reproduce the "Zappy" fake example in the Delta docs and in this Databricks blog - in a simple "Hello World" manner without installing anything locally.

It's not easy for someone coming into the Delta ecosystem to get up to speed - having a reproducible example managed as a Dockerfile in this repo would be a phenomenal value add.

I see the PR by @stikkireddy - I don't think this meets the use case, because he's using Spark in his Dockerfile - and its just a wrapper around Jupyter notebooks (i.e. a simplified Databricks).

Rather, there should be a vanilla Java Dockerfile that has nothing to do with Apache Spark to uphold the theme of the Standalone writer - let non-Spark clients become a first class Delta citizen. It's difficult for devs to do this if the barrier to entry is high, a Dockerfile would reduce this to 5 mins.

It should be trivial to create such a Dockerfile for the maintainers of this repo - your help is appreciated!

mdrakiburrahman avatar Apr 20 '22 10:04 mdrakiburrahman

We will run spark job base on delta lake on top of K8S and also need such a Dockerfile

allenhaozi avatar Apr 29 '22 06:04 allenhaozi

@zsxwing is this what you were looking for? https://github.com/spaceandtimelabs/docker-spark-deltalake

avantgardnerio avatar Jun 20 '22 18:06 avantgardnerio

@zsxwing - looks like there is already a PR open for this issue. What are your recommended next steps? If a new contributor would like to work on this issue, should they just open another PR?

MrPowers avatar Sep 12 '22 16:09 MrPowers

Feel free to leave a comment in the open PR and pick up whatever is left.

zsxwing avatar Sep 20 '22 17:09 zsxwing

I'll comment directly in PR #922 but for a Spark-based Docker, I'm wondering if it would be helpful if we leveraged the proposed Spark docker per SPIP: Support Docker Official Image for Spark.

dennyglee avatar Sep 27 '22 16:09 dennyglee

https://github.com/delta-io/delta-docs/tree/main/static/quickstart_docker

alberttwong avatar May 26 '23 23:05 alberttwong

Thanks @alberttwong - completely spaced out on this!

Yes, we have both the docker code as you noted as well as we pushed the Delta docker to DockerHub at http://go.delta.io/dockerhub.

I'm thinking that we need to create a separate docker repo (e.g., https://github.com/delta-io/docker) so we can automate the docker builds. Saying this, as the docker has been created, thinking this may be appropriate to close. Will leave it open for a short while for more comments before closing it, eh?!

dennyglee avatar May 27 '23 00:05 dennyglee