beam [Feature Request]: Host docker images with the `conda` package manager for Beam's Python SDK.

[Feature Request]: Host docker images with the `conda` package manager for Beam's Python SDK.

Open alxmrs opened this issue 1 year ago • 22 comments

What would you like to happen?

Acquiring scientific dependencies in the Python ecosystem is challenging. pip and apt-get alone are not sufficient, for various reasons, the most significant of which is due to community. The scientific python community has standardized to one package manager: Anaconda. Within that package manager, most scientific software is built and distributed via conda-forge.

Given this, I propose the following: The Apache Beam Project should builds a new set of Docker images that include a conda manage python environment. The Dockerfile for the containers could look like this:

ARG py_version=3.8
FROM apache/beam_python${py_version}_sdk:2.40.0 as beam_sdk
FROM continuumio/miniconda3:4.12.0
ARG py_version

# Update miniconda
RUN conda update conda -y

# Install desired python version
RUN conda install python=${py_version} -y

# Install SDK.
RUN pip install --no-cache-dir apache-beam[gcp]==2.40.0

# Verify that the image does not have conflicting dependencies.
RUN pip check

# Copy files from official SDK image, including script/dependencies.
COPY --from=beam_sdk /opt/apache/beam /opt/apache/beam

# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

From such an image, Python SDK users will gain immense flexibility in adding dependencies to their Beam runtime environment (especially Dataflow, and likely including all remote Beam runners). For example, adding a genuinely difficult-to-install dependency would be as easy as adding

conda install <package-name> -c conda-forge -y

to a setup.py file (following the CUSTOM_COMMANDS pattern).

Why should Apache Beam do this and not another third party?

This is a unique application of custom containers in Beam. Instead of an image with specific dependencies for an application, this package manager can obtain nearly all dependencies in the Python ecosystem. I argue that it makes sense for a member of the Apache project, or similarly open and community-federated project, manage and host this image in order to guard against potential supply chain attacks. Further, including conda as a Python SDK runtime environment would accelerate dependency management, especially of the PyData stack, on Apache Beam: It would help avoid the creation of lots of similar Docker images (to host each specific dependency, or else, to duplicate hosting conda).

Issue Priority

Priority: 3

Issue Component

Component: sdk-py-core

Jul 20 '22 00:07 alxmrs

CC: @pabloem, @rabernat, @cisaacstern

Jul 20 '22 00:07 alxmrs

Thanks for raising this, Alex.

@yuvipanda found a solution for us in Pangeo Forge which achieves this goal: https://github.com/pangeo-data/pangeo-docker-images/pull/355

Would be great to generalize this for the Beam scientific Python community.

Jul 20 '22 00:07 cisaacstern

IIUC, Yuvi discovered that copying the pre-compiled Beam boot script could be brittle for this use case, because it invokes system Python, whereas for this use case we want Python processes to run from conda Python. That's why in the Dockerfile in the above-linked PR, he compiles the Beam boot script from the Go source. There are almost certainly details here I'm overlooking, but wanted to highlight this point, as it was non-obvious to me, and seems an important consideration for this feature.

Jul 20 '22 00:07 cisaacstern

@yuvipanda has provided some further suggestions in https://github.com/pangeo-data/pangeo-docker-images/pull/355#issuecomment-1190546160, in response to @TheNeuralBit's inquiry on that PR. AFAICT, his suggestion of deprecating the requirement for a beam ENTRYPOINT boot script, in favor of configuring the apache-beam Python package to support Dataflow deployment automatically if it is installed, may simplify this considerably. In that scenario, any user-provided image with apache-beam installed could work on Dataflow, and it would be up to the user to set system vs. conda Python in $PATH. I'm somewhat out of my depth on implementation details here, but Yuvi suggested this would be possible (in the linked PR), and it seems like the most general solution. @alxmrs, thoughts?

Jul 20 '22 19:07 cisaacstern

FYI @tvalentyn

Jul 20 '22 20:07 TheNeuralBit

it invokes system Python, whereas for this use case we want Python processes to run from conda Python

Interesting – I haven't hit this problem. I use the docker image in the issue description (or see here) along with build step in a setup.py file to install conda dependencies.

I just took a look at the pangeo Dockerfile that compiles sources with go. I think this step is unnecessary; rather, I believe that there are other strategies to set up conda python on the image. This works by installing deps into the base conda environment instead of a named conda env – maybe that's the source of the problem?

On @yuvipanda's suggestions: I think the status quo offers enough hooks to accomplish a lot of what Yuvi is asking for. For example, the boot sequence for Beam will eventually call setup.py install (and setup.py build). From there, you can install required dependencies, or use distutils to perform sophisticated build actions, including running raw linux commands. Having said all this, I actually don't really know what problem Yuvi / the forge container is trying to solve in the first place. Thus, I am probably totally off base. @cisaacstern or @yuvipanda can you help me understand the infrastructure?

To add a high level note on my proposal: I think of this Beam/conda docker image (with the boot entrypoint) as the runtime environment, from which we can add further dependencies at startup time. This is unlike the typical docker approach, where we include all dependencies at the image build time.

Jul 20 '22 20:07 alxmrs

Thanks a lot for working through this, @alxmrs!

I actually don't really know what problem Yuvi / the forge container is trying to solve in the first place.

This is a great question! We (the pangeo-project) maintain a set of docker images pre-built with specific pinned versions of common dependencies in the earth sciences ecosystem - https://github.com/pangeo-data/pangeo-docker-images/. We provided dated tags that people can reference and use wherever they need to run code - in JupyterHubs (for interactive Jupyter use), in dask (for scale-out workflows), etc. The goal of the forge/ image is to provide a version that is usable in Apache beam contexts. These are fairly heavy images - the conda based environment build step takes at least 10 minutes, and often longer, to run - and so we can't really do these at runtime. We also want to make sure the packages are tested to work together, as they often have complex C (or even fortran!) based dependencies. There's also a reproducibility angle here, as specifying the docker image tag a workflow is using provides a better chance of longer term reproducibility than just a list of packages to install.

The goal is for end users to be able to pick a tag and know that it works with the rest of the geosciences stack curated by pangeo. I hope that helps clarify the goal of the forge/ image.

I'm not entirely sure what the original problem with copying the go binary was, as long as we weren't copying the python packages. Possibly something to do with the inherited entrypoint? I'm doing some funky stuff in https://github.com/pangeo-data/pangeo-docker-images/pull/355/files#diff-a77643b43a7be453fa8556937bf32b27907e152a10d4c693f3e7670c66a44378 to have the entrypoint work for both beam as well as for jupyter.

Jul 20 '22 20:07 yuvipanda

@akedin looks like the go binary isn't statically built (https://github.com/pangeo-data/pangeo-docker-images/runs/7438257343?check_suite_focus=true - the static built test I have fails). I don't think it'll be a problem for us specifically (because we're on ubuntu), but would you consider turning cgo off so these can be fully static builds?

Jul 20 '22 21:07 yuvipanda

Thanks for the explanation Yuvi! That's really helpful. One concept that I'm missing, however, is: what does the pangeo forge entrypoint need to do, besides boot beam? Are there startup tasks needed for PGF? If so, what are they?

Jul 20 '22 21:07 alxmrs

@alxmrs it needs to boot beam when called by beam, but be able to run other programs when not called by beam. If we set entry point unconditionally to beam boot, the image isn't usable in interactive running contexts - particularly, on mybinder.org or on various Jupyter contexts. That is very helpful for users to try out their code interactively to debug failures before running it through beam. I hope that makes sense?

Jul 20 '22 22:07 yuvipanda

Maybe, we should have a video chat soon to expedite this discussion. The Beam docker image, in my understanding, is to package a runtime environment for Beam workers (the specifics of which depend on the runner used. Most commonly, this is the base image for workers running a Dataflow job). If you need to use Beam in an interactive context like in a Jupyter notebook, why use the Docker image at all? Why not just install beam with pip / conda, and let users use the local runner? This would enable users to iterate on the pipelines themselves before "the dataprocessing step."

If this is an accurate model of the problem you're trying to solve, then I'd like to make explicit the two types of runtime environments:

Development time / interactive use. This could use the python Beam SDK.
Remote execution & deployment. This is closer to what I had in mind with this issue.

From here, I can see how there's a desire to create one image to handle both use cases. However, I bet it's better to handle each of these with their own docker images, maybe via a multistage build.

Jul 20 '22 22:07 alxmrs

Happy to get on a video call too :) What TZ are you in?

The point isn't to use beam in an interactive context, but to use the libraries that are going to be used in the python code that is being run by beam in an interactive context. For example, https://github.com/pangeo-forge/staged-recipes/blob/master/recipes/cmip6/recipe.py will eventually be run on beam, and it uses https://github.com/pangeo-forge/pangeo-forge-recipes - which in turn uses libraries like xarray, zarr etc. So the goal is to test that. And these aren't being used in local Jupyter environments, but cloud hosted environments that are running on kubernetes (via z2jh.jupyter.org, or tools like mybinder.org). See the 'pangeo forge sandbox' here - https://pangeo-forge.readthedocs.io/en/latest/introduction_tutorial/index.html. It's running inside a ephemeral docker container in mybinder.org, which is based on JupyterHub.

Why not just install beam with pip / conda, and let users use the local runner?

There is ongoing work on that too! The goal is to support three different contexts:

Users' local machines (via conda metapackages or similar that provide pinned versions - not implemented yet)
On cloud hosted interactive Jupyter environment, inside Docker / k8s
When called by Beam in dataflow.

I hope this helps!

@cisaacstern probably has more details than I do!

Jul 21 '22 01:07 yuvipanda

That's a very good summary, @yuvipanda. I don't think I have anything to add, but happy to answer any further questions here.

Jul 21 '22 17:07 cisaacstern

What TZ are you in?

I'm in PST! Happy to jump on a call as soon as tomorrow or next week. What time zone are you in?

Thanks for the explanation of the problem at hand. I have a few ideas that could help; let's talk about it soon.

You can reach out to me at [email protected] for DMs/scheduling, too.

Jul 21 '22 21:07 alxmrs

Thanks for opening this. We looked into investing more in Conda support, but so far these efforts were not prioritized sufficiently. Some prior discussions linked from https://github.com/apache/beam/issues/21481 didn't suggest strong interest, so really appreciate the feedback here.

Noting that you suggest installing Apache Beam with pip. I heard that when installations from conda repositories and pip repositories are combined in the same environment, it may cause interoperability issues between libraries when they include c-extensions / compiled code. I wonder if this is frequent in practice in your experience?

Aug 04 '22 01:08 tvalentyn

cc: @AnandInguva

Aug 04 '22 01:08 tvalentyn

Noting that you suggest installing Apache Beam with pip.

We switched back to installing it with conda now that the conda-forge package for apache-beam is fixed (https://github.com/conda-forge/apache-beam-feedstock/pull/52). Mixing packages mostly works until it doesn't, of course... :)

I think at least with the dataflow runner, beam just installs itself with pip at container startup time regardless of wether it's already installed in there? This actually also downgrades numpy for some reason... Would be nice if that was conditional!

Aug 04 '22 01:08 yuvipanda

Would be nice if that was conditional!

Right. We could make a choice or perhaps see if conda is available, then install from conda (potentially not backwards-compatible behavior).

Mixing packages mostly works until it doesn't, of course... :)

What are typical symptoms of failure?

Aug 04 '22 02:08 tvalentyn

https://github.com/conda-forge/apache-beam-feedstock/pull/52

Thanks for doing this. Is the release process to conda-forge mostly automated ?

Aug 04 '22 02:08 tvalentyn

Right. We could make a choice or perhaps see if conda is available, then install from conda (potentially not backwards-compatible behavior).

Or rather, check if the appropriate apache_beam package is already installed (via even something as simple as pip list | grep apache_beam - this reports installs via conda or pip) and skipping installation if it's present and the right version. This also speeds up installations!

Aug 04 '22 02:08 yuvipanda

Thanks for doing this. Is the release process to conda-forge mostly automated ?

I'm not an expert in the conda-forge setup, but I think it's sortof-mostly automated?

Aug 04 '22 04:08 yuvipanda

Note: Google will prioritize and address this in Q4, unless someone from the community would like to tackle it earlier.

Aug 09 '22 18:08 ryanthompson591

Circling back on this issue: I think someone here (maybe @yuvipanda, @blackvvine, or myself) may want to expedite work on this solution. Here are a number of related issues that could be solved by the existence of this image:

pange-forge-runner: https://github.com/pangeo-forge/pangeo-forge-runner/pull/39

Right now they're offering a requirements.txt to manage dependencies. However, this will not address installing non-pip managed deps, namely those that require binary installation on the host machine.
weather-tools: https://github.com/google/weather-tools/issues/179 (relevant discussion also in https://github.com/google/weather-tools/issues/217)

Our project currently makes use of dependencies that are only available via conda-forge (the apt-get install path is broken and not maintained by the original developer). @blackvvine and I are working together in https://github.com/conda-forge/staged-recipes/pull/20892 to release our project primarily so that we can have consistent local and remove environments, made possible by conda.

@ryanthompson591: Q4 came sooner than I thought! Is anyone tasked with working on this feature right now? Either way, how about the 3-4 of us have a meeting to discuss logistics related to the creation of this image.

One the image is operational, we'll definitely make use of it. Then, to gain wider adoption by all Beam Python users, I'm happy to make progress on #22675.

Nov 07 '22 20:11 alxmrs

@alxmrs Thanks a lot for circling back on this, I am not aware of anyone working on this feature right now.

Thanks so much for volunteering to offer help here.

I think it should be straightforward and reasonable to add conda support in Beam's docker entrypoint, but would require more consideration to actually include a conda-enabled image into our released artifacts as it would require us taking on additional support burden maintaining these images going forward, so we should evaluate demand pros' and pros/cons for having a conda-enabled image versus offering an easy container customization path for users who would like to build a custom conda-enabled image to use in their pipelines.

Something we could start with is a conda-aware entrypoint and documentation how to build a custom conda-based image from a dockerfile + integration tests that verifies this solution continues to work.

After ironing out rough edges we could consider the more permanent solution and potentially releasing the conda image. I think that's how I would approach it currently, but this is a very good conversation to have on dev@ mailing list.

Would you be interested in starting a one-pager with your proposal and sending it out?

Nov 15 '22 18:11 tvalentyn

Sorry for the late reply. That sounds like a plan.

In my primary project, we're developing a project-specific docker image for better conda/dataflow deployments. We'll see if in the future, there are lessons we can take back to the dev list for how to better integrate conda.

I'd be happy to collaborate on a 1-pager with @yuvipanda over the next few weeks (time permitting) that takes lessons from weather-tools and the pangeo-forge runner.

Jan 27 '23 00:01 alxmrs

Glad to see this moving forward! Happy to contribute however I can though I agree Yuvi is the more qualified person to speak to specifics for Pangeo Forge. In general, we have continued to feel recently that conda support in an official Beam image would be very valuable to us.

Jan 27 '23 01:01 cisaacstern

beam beam copied to clipboard

[Feature Request]: Host docker images with the `conda` package manager for Beam's Python SDK.

What would you like to happen?

Why should Apache Beam do this and not another third party?

Issue Priority

Issue Component

beam
beam copied to clipboard