beam icon indicating copy to clipboard operation
beam copied to clipboard

[BEAM-12792] Install pipline dependencies to temporary venv

Open phoerious opened this issue 3 years ago • 60 comments

The change allows users to submit multiple pipelines to the same SDK container without dependency conflicts. Dependencies in --setup_only mode are still installed globally.

At the moment, this is more of a draft and needs further inspection by someone more knowledgeable about the SDK container plumbing, since there are still a few life cycle issues.

With this patch, the deployment cycle looks like this:

  1. Worker pool is spawned (PID 1 in SDK container => worker pool ID = 1).
  2. Flink starts a job and a worker gets spun up with ID 1-1.
  3. The worker creates a temporary venv in /tmp/beam-venv/beam-venv-worker-1-1 and deploys the artifact with dependencies.
  4. Processing happens ...
  5. Everything is cleaned up and a new job can be submitted.

~Main problems: I don't get a job ID from the provision info (only a name), so I cannot namespace the venv properly and multiProcessExactlyOnce() is bound to the entire worker pool life cycle. Hence, for 5. to work, I had to relax multiProcessExactlyOnce() to clean up its completion file after all child processes have finished.~

~Obvious problem with that: If a second worker process is submitted to the pool later on, whoever finishes first will clean up the resources without knowledge of any other workers that may still be running. If that worker is for another concurrent job, the venv dependencies will conflict (as before), since the venv is global to the entire pool.~

~Now I don't know if and how any of this can happen. The boot binary seems to take care of spawning sibling workers from the same parent worker process, but from the looks of it, worker_pool_main.py might spawn yet another worker at any time.~

~Part of a solution would be to bind the artifact deployment to a job ID (which I don't have), but then I'd still need to know when it is safe to do the cleanup. Another solution would be to re-deploy the artifact and its dependencies for each worker (probably safer, but with more redundancy).~

The venvs are bound to individual workers (or sets of workers if they have siblings) and are cleaned up as soon as they finish. This makes multiProcessExactlyOnce() obsolete, so it was removed. Additional care was taken to ensure a clean signal flow between individual components, so anything that would cause a premature os.Exit() or a SIGKILL to be sent was adjusted to allow for clean worker shutdown.

BTW Is it intended that the Go code is indented with tabs???!

R: @tvalentyn


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • [x] Choose reviewer(s) and mention them in a comment (R: @username).
  • [x] Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • [x] Update CHANGES.md with noteworthy changes.
  • [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

ValidatesRunner compliance status (on master branch)

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- Build Status Build Status Build Status Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Python --- Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status ---
XLang Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status ---

Examples testing status on various runners

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- --- --- --- --- --- ---
Java --- Build Status
Build Status
Build Status
--- --- --- --- ---
Python --- --- --- --- --- --- ---
XLang --- --- --- --- --- --- ---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go Java Python
Build Status Build Status Build Status
Build Status
Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels Python tests Java tests

See CI.md for more information about GitHub Actions CI.

phoerious avatar Jan 31 '22 15:01 phoerious

cc: @ryanthompson591 for initial pass.

tvalentyn avatar Feb 01 '22 22:02 tvalentyn

Any progress here?

phoerious avatar Feb 04 '22 13:02 phoerious

R: @tvalentyn

ryanthompson591 avatar Feb 07 '22 15:02 ryanthompson591

@ryanthompson591 @tvalentyn I updated the PR. The venvs are now using random names and are bound to the workers, which is the only way to make this safe.

I also fixed how workers are cleaned up. Previously, they were simply SIGKILL'ed by the worker pool Python executable, which prevented any kind of clean up and also caused zombie processes inside the containers. I think there are also still some cases where processes are not cleaned up properly and just keep running forever, but most of that should be fixed now. Processes that keep running forever happen particularly when I'm using a global CombineFn, which causes Flink to believe that the last remaining worker is still running even though it has long finished. When that happens, not even cancelling the job will send signals to the remaining workers. But that's another bug (I reported that before on the mailing list, but never got a response).

All of this needs some more testing, but it seems to be running fine on my Flink cluster at least.

phoerious avatar Feb 21 '22 18:02 phoerious

Codecov Report

Merging #16658 (5a91e67) into master (4522f4c) will decrease coverage by 0.00%. The diff coverage is 7.14%.

@@            Coverage Diff             @@
##           master   #16658      +/-   ##
==========================================
- Coverage   73.32%   73.32%   -0.01%     
==========================================
  Files         714      714              
  Lines       96403    96409       +6     
==========================================
+ Hits        70685    70688       +3     
- Misses      24398    24401       +3     
  Partials     1320     1320              
Flag Coverage Δ
python 83.20% <7.14%> (-0.01%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...hon/apache_beam/runners/worker/worker_pool_main.py 56.32% <7.14%> (-2.94%) :arrow_down:
.../python/apache_beam/transforms/periodicsequence.py 98.50% <0.00%> (+1.49%) :arrow_up:
sdks/python/apache_beam/utils/interactive_utils.py 97.56% <0.00%> (+2.43%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Feb 21 '22 20:02 codecov[bot]

@tvalentyn @ryanthompson591 All right. I think I have it a point where it's ready for a final review. It runs robustly on a 130 node Flink cluster, all processes are reliably and gracefully cleaned up, and I haven't seen any zombie processes.

(also pinging the guys from the OWNERs file, maybe someone has additional remarks: @herohde @aaltay @charlesccychen)

phoerious avatar Feb 22 '22 13:02 phoerious

Current set of reviewers looks great. I do not have much to add. Thanks for adding me.

worker_pool_main changes - R: @y1chi could review.

aaltay avatar Feb 22 '22 17:02 aaltay

I added one more change: If a worker exits with a non-zero exit code, the boot binary also exits with a non-zero code. That makes it easier to debug things like Python interpreter crashes, e.g., due to failures in a native extension. That case was very hard to debug previously.

phoerious avatar Feb 23 '22 16:02 phoerious

I reverted the return code thing, because it turns out Flink likes to send SIGTERM also on successful completion, so it's of no use.

phoerious avatar Feb 24 '22 10:02 phoerious

Any reviews? I've been using my patch on our cluster for processing 1 billion web pages already, no issues. I cannot vouch for anything other than Flink, however.

phoerious avatar Mar 01 '22 10:03 phoerious

Current set of reviewers looks great. I do not have much to add. Thanks for adding me.

worker_pool_main changes - R: @y1chi could review.

LGTM for worker_pool_main changes

y1chi avatar Mar 01 '22 18:03 y1chi

Run Python Dataflow ValidatesContainer

tvalentyn avatar Mar 05 '22 01:03 tvalentyn

@phoerious please don't squash reviewed and unreviewed code until the review has finalized, see: https://beam.apache.org/contribute/#make-the-reviewers-job-easier

tvalentyn avatar Mar 07 '22 23:03 tvalentyn

Looks like Dataflow's IT tests on prior iteration failed with:

2022-03-04 18:49:15.451 PST
"2022/03/05 02:49:15 Found artifact: dataflow_python_sdk.tar "
Info
2022-03-04 18:49:15.451 PST
"2022/03/05 02:49:15 Installing setup packages ... "
Info
2022-03-04 18:49:15.460 PST
"2022/03/05 02:49:15 Cleaning up temporary venv ... "
Info
2022-03-04 18:49:15.490 PST
"2022/03/05 02:49:15 Failed to install required packages: failed to install SDK: fork/exec /var/opt/google/beam-venv/beam-venv-worker-sdk0/bin/pip: permission denied " "

Jenkins logs: https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont_PR/227/ (although these don't contain the Dataflow job logs)

tvalentyn avatar Mar 07 '22 23:03 tvalentyn

Strange. Does Dataflow provision local file systems that prevent setting the +x flag on file?

phoerious avatar Mar 08 '22 08:03 phoerious

Actually, can you point me to the correct logs? I cannot find the error you are referring to. Only seemingly unrelated errors.

phoerious avatar Mar 08 '22 08:03 phoerious

Dataflow just spins up a docker container using kubelet, no special filesystem configuration. I am afraid you might not be able to see the logs since they require viewer access to apache-beam-testing GCP project, but if you have a Google account, I can try to add your email, feel free to send me an email.

tvalentyn avatar Mar 08 '22 21:03 tvalentyn

Let's make sure the error is still reproducible with latest changes

tvalentyn avatar Mar 08 '22 21:03 tvalentyn

Run Python Dataflow ValidatesContainer

tvalentyn avatar Mar 08 '22 21:03 tvalentyn

Looks like it indeed still fails. Actually, Dataflow does do some configuration. This may have to do with how the 'semi-persistent' directory is mounted into the container. I think in Dataflow's case it is configured as additional docker volume.

tvalentyn avatar Mar 09 '22 00:03 tvalentyn

Unrelated: is the PR description up to date?

tvalentyn avatar Mar 09 '22 00:03 tvalentyn

I updated the description.

phoerious avatar Mar 09 '22 08:03 phoerious

@tvalentyn What's the plan now? I can't really debug Google Dataflow on my own.

phoerious avatar Mar 14 '22 08:03 phoerious

I'll take a look or find somebody who can help.

tvalentyn avatar Mar 18 '22 19:03 tvalentyn

@tvalentyn Sorry for nagging, but any news?

phoerious avatar Apr 05 '22 08:04 phoerious

@tvalentyn is currently OOO.

@kerrydc do you know if someone else could help?

aaltay avatar Apr 06 '22 00:04 aaltay

I spent some time debugging this today and got this far:

Dataflow container pod for SDK container has the following mount (from container manifest):

"volumeMounts": [ {
        "mountPath": "/var/opt/google",
        "name": "persist"
      }, 

...
"volumes": [ {
      "hostPath": {
        "path": "/var/opt/google/dataflow"
      },
      "name": "persist"
    }

The above maps a persistent directory /var/opt/google/dataflow on the host VM into a directory /var/opt/google in the running containers. This directory is passed to sdk harness container in --semi_persist_dir=/var/opt/google param.

We can see this directory referenced in the errors:

"2022/04/08 20:40:15 Failed to install required packages: failed to install SDK: fork/exec /var/opt/google/beam-venv/beam-venv-worker-sdk-0-0/bin/pip: permission denied

I can reproduce this behavior if I manually SSH into a VM created by Dataflow, and manually start a container via:

docker run -it --entrypoint=/bin/bash -v /var/opt/google/dataflow:/var/opt/google apache/beam_python3.7_sdk

root@beamapp-valentyn-04082037-04081337-w2xj-harness-zq0w:/# python -m venv /var/opt/google/env
root@beamapp-valentyn-04082037-04081337-w2xj-harness-zq0w:/# /var/opt/google/env/bin/pip
bash: /var/opt/google/env/bin/pip: Permission denied

On the other hand creating a venv elsewhere works:

root@beamapp-valentyn-04082037-04081337-w2xj-harness-zq0w:/# python -m venv /var/opt/env
root@beamapp-valentyn-04082037-04081337-w2xj-harness-zq0w:/# /var/opt/env/bin/pip

Usage:   
  pip <command> [options]

Permissions on the files are identical:

root@beamapp-valentyn-04082037-04081337-w2xj-harness-zq0w:/# ls -al /var/opt/env/bin/pip
-rwxr-xr-x 1 root root 228 Apr  8 21:29 /var/opt/env/bin/pip
root@beamapp-valentyn-04082037-04081337-w2xj-harness-zq0w:/# ls -al /var/opt/google/env/bin/pip
-rwxr-xr-x 1 root root 235 Apr  8 21:29 /var/opt/google/env/bin/pip

but for some reasons, when we create a venv in the mounted directory, the pip script cannot launch. I am not sure what is going on.

tvalentyn avatar Apr 08 '22 22:04 tvalentyn

Here's a repro without Dataflow in the picture:

gcloud compute instances create  valentyn-cos2 --zone=us-central1-f --project=<my gcp project> --image-family cos-stable --image-project=cos-cloud   --restart-on-failure  
gcloud compute  ssh valentyn-cos2 --zone=us-central1-f --project=<my gcp project>
docker run -it --entrypoint=/bin/bash -v /tmp:/var/opt/google python:3.7-bullseye

python -m venv /var/opt/google/env
root@018e7f3cac24:/# /var/opt/google/env/bin/pip
bash: /var/opt/google/env/bin/pip: Permission denied

tvalentyn avatar Apr 08 '22 22:04 tvalentyn

As for what to do next... @phoerious if you are interested and have access to a GCP project, you could continue investigate what happens in https://github.com/apache/beam/pull/16658#issuecomment-1093407575.

Another avenue, is to exclude the DataflowRunner from the new behavior. I think we should be able to cleanly detect the runner by analyzing pipeline options in https://github.com/apache/beam/blob/2d27f44f58112ca4a5689614a94592de4e28d476/sdks/python/container/boot.go#L138

tvalentyn avatar Apr 08 '22 22:04 tvalentyn

Another possibility is to try using python -m pip install in place of pip install throughout. It may help as well.

tvalentyn avatar Apr 08 '22 22:04 tvalentyn