mack
mack copied to clipboard
Docker image for contributors
A ready-to-use docker image for project contributors as alternative to (local) Python environment via Poetry. The image probably would need to include things like
- base image (e.g. ubuntu)
- (py)spark
- delta
- python libs
- environment vars
- etc.
I just tried to find examples for other oss repos but didn't find any in the short research time. So maybe this is not useful to most contributors or there are other reasons not to have it. Nothing would come to my mind atm.
@Triamus - thanks for adding this.
Anyone in the community can feel free to grab this.
@MrPowers I have built this kind of docker images in the past. Mind if I take a stab at this?
@souvik-databricks you probably know this but a few things I already researched.
- delta lake project seems to work on a docker quickstart but I couldn't find it in the official docs, however, it's in GitHub/delta-io/delta-docs: quickstart_docker
- apache spark has non-docker reviewed official images at GitHub/apache/spark-docker: Official Dockerfile for Apache Spark
- there is an open jira issue Spark Jira: SPARK-40513 - SPIP: Support Docker Official Image for Spark with corresponding google doc SPIP: Support Docker Official Image for Spark for an "official", i.e. docker certified image
I would think that ideally any image is building on top of those efforts but I don't know the timeline. In Jira they speak of Spark 3.4.
@souvik-databricks I'd be happy to test things out if needed.
@souvik-databricks - yea, sure, go for it!
I think there are some Delta Lake docker images around. Let me take a look.
Actually, looks like @Triamus has already provided the link, here it is: GitHub/delta-io/delta-docs: quickstart_docker
@MrPowers I have a local branch ready to go for Docker
and docker-compose
support for mack
if you want it. Runs the unit tests inside the container and also has instructions for dropping into the container for development as well. Container has Spark (spark-3.3.2
), Delta (delta-core_2.12:2.2.0
), etc, everything needed to develop and test.
@danielbeach - yea, that sounds great. Any chance you could send a PR? I'll be happy to test, document in the README, and market. Thank you!
@MrPowers I tried to push a PR, but need access.
@danielbeach - sent you an invite to collab on the repo ;)
In the opening of the issue, I mentioned that I didn't find nice OSS examples of creating a reproducible local dev setup for contributors of a project. By coincident, I saw a talk from PyData Global 2022 which was recently uploaded to youtube on exactly that topic from one of the core Airflow devs. And it turns out that Airflow has invested a lot in what they call a breeze
environment to cover everything from local dev and test to deployment. It is certainly overngineering for mack at this point but it has some nice insights and potential ideas that one can draw inspiration from. I leave the talk and Airflow Breeze
docs here for future reference.
- Jarek Potiuk - Managing Python Dependencies at scale | PyData Global 2022 on Youtube
- Airflow Breeze Docs
From the docs:
Airflow Breeze is an easy-to-use development and test environment using Docker Compose. The environment is available for local use and is also used in Airflow's CI tests. We call it Airflow Breeze as It's a Breeze to contribute to Airflow. The advantages and disadvantages of using the Breeze environment vs. other ways of testing Airflow are described in CONTRIBUTING.rst.