beam
beam copied to clipboard
[BEAM-12792] Install pipline dependencies to temporary venv
The change allows users to submit multiple pipelines to the same SDK container without dependency conflicts. Dependencies in --setup_only
mode are still installed globally.
At the moment, this is more of a draft and needs further inspection by someone more knowledgeable about the SDK container plumbing, since there are still a few life cycle issues.
With this patch, the deployment cycle looks like this:
- Worker pool is spawned (PID 1 in SDK container => worker pool ID =
1
). - Flink starts a job and a worker gets spun up with ID
1-1
. - The worker creates a temporary venv in
/tmp/beam-venv/beam-venv-worker-1-1
and deploys the artifact with dependencies. - Processing happens ...
- Everything is cleaned up and a new job can be submitted.
~Main problems: I don't get a job ID from the provision info (only a name), so I cannot namespace the venv properly and multiProcessExactlyOnce()
is bound to the entire worker pool life cycle. Hence, for 5. to work, I had to relax multiProcessExactlyOnce()
to clean up its completion file after all child processes have finished.~
~Obvious problem with that: If a second worker process is submitted to the pool later on, whoever finishes first will clean up the resources without knowledge of any other workers that may still be running. If that worker is for another concurrent job, the venv dependencies will conflict (as before), since the venv is global to the entire pool.~
~Now I don't know if and how any of this can happen. The boot
binary seems to take care of spawning sibling workers from the same parent worker process, but from the looks of it, worker_pool_main.py
might spawn yet another worker at any time.~
~Part of a solution would be to bind the artifact deployment to a job ID (which I don't have), but then I'd still need to know when it is safe to do the cleanup. Another solution would be to re-deploy the artifact and its dependencies for each worker (probably safer, but with more redundancy).~
The venvs are bound to individual workers (or sets of workers if they have siblings) and are cleaned up as soon as they finish. This makes multiProcessExactlyOnce()
obsolete, so it was removed. Additional care was taken to ensure a clean signal flow between individual components, so anything that would cause a premature os.Exit()
or a SIGKILL
to be sent was adjusted to allow for clean worker shutdown.
BTW Is it intended that the Go code is indented with tabs???!
R: @tvalentyn
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
- [x] Choose reviewer(s) and mention them in a comment (
R: @username
). - [x] Format the pull request title like
[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [x] Update
CHANGES.md
with noteworthy changes. - [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.
See the Contributor Guide for more tips on how to make review process smoother.
ValidatesRunner
compliance status (on master branch)
Lang | ULR | Dataflow | Flink | Samza | Spark | Twister2 |
---|---|---|---|---|---|---|
Go | --- |
|
|
|
|
--- |
Java |
|
|
|
|
|
|
Python | --- |
|
|
|
|
--- |
XLang |
|
|
|
|
|
--- |
Examples testing status on various runners
Lang | ULR | Dataflow | Flink | Samza | Spark | Twister2 | |
---|---|---|---|---|---|---|---|
Go | --- | --- | --- | --- | --- | --- | --- |
Java | --- |
|
--- | --- | --- | --- | --- |
Python | --- | --- | --- | --- | --- | --- | --- |
XLang | --- | --- | --- | --- | --- | --- | --- |
Post-Commit SDK/Transform Integration Tests Status (on master branch)
Go | Java | Python |
---|---|---|
|
|
|
Pre-Commit Tests Status (on master branch)
--- | Java | Python | Go | Website | Whitespace | Typescript |
---|---|---|---|---|---|---|
Non-portable |
|
|
|
|
|
|
Portable | --- |
|
|
--- | --- | --- |
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.
cc: @ryanthompson591 for initial pass.
Any progress here?
R: @tvalentyn
@ryanthompson591 @tvalentyn I updated the PR. The venvs are now using random names and are bound to the workers, which is the only way to make this safe.
I also fixed how workers are cleaned up. Previously, they were simply SIGKILL'ed by the worker pool Python executable, which prevented any kind of clean up and also caused zombie processes inside the containers. I think there are also still some cases where processes are not cleaned up properly and just keep running forever, but most of that should be fixed now. Processes that keep running forever happen particularly when I'm using a global CombineFn, which causes Flink to believe that the last remaining worker is still running even though it has long finished. When that happens, not even cancelling the job will send signals to the remaining workers. But that's another bug (I reported that before on the mailing list, but never got a response).
All of this needs some more testing, but it seems to be running fine on my Flink cluster at least.
Codecov Report
Merging #16658 (5a91e67) into master (4522f4c) will decrease coverage by
0.00%
. The diff coverage is7.14%
.
@@ Coverage Diff @@
## master #16658 +/- ##
==========================================
- Coverage 73.32% 73.32% -0.01%
==========================================
Files 714 714
Lines 96403 96409 +6
==========================================
+ Hits 70685 70688 +3
- Misses 24398 24401 +3
Partials 1320 1320
Flag | Coverage Δ | |
---|---|---|
python | 83.20% <7.14%> (-0.01%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
...hon/apache_beam/runners/worker/worker_pool_main.py | 56.32% <7.14%> (-2.94%) |
:arrow_down: |
.../python/apache_beam/transforms/periodicsequence.py | 98.50% <0.00%> (+1.49%) |
:arrow_up: |
sdks/python/apache_beam/utils/interactive_utils.py | 97.56% <0.00%> (+2.43%) |
:arrow_up: |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
@tvalentyn @ryanthompson591 All right. I think I have it a point where it's ready for a final review. It runs robustly on a 130 node Flink cluster, all processes are reliably and gracefully cleaned up, and I haven't seen any zombie processes.
(also pinging the guys from the OWNERs file, maybe someone has additional remarks: @herohde @aaltay @charlesccychen)
Current set of reviewers looks great. I do not have much to add. Thanks for adding me.
worker_pool_main changes - R: @y1chi could review.
I added one more change: If a worker exits with a non-zero exit code, the boot binary also exits with a non-zero code. That makes it easier to debug things like Python interpreter crashes, e.g., due to failures in a native extension. That case was very hard to debug previously.
I reverted the return code thing, because it turns out Flink likes to send SIGTERM also on successful completion, so it's of no use.
Any reviews? I've been using my patch on our cluster for processing 1 billion web pages already, no issues. I cannot vouch for anything other than Flink, however.
Current set of reviewers looks great. I do not have much to add. Thanks for adding me.
worker_pool_main changes - R: @y1chi could review.
LGTM for worker_pool_main changes
Run Python Dataflow ValidatesContainer
@phoerious please don't squash reviewed and unreviewed code until the review has finalized, see: https://beam.apache.org/contribute/#make-the-reviewers-job-easier
Looks like Dataflow's IT tests on prior iteration failed with:
2022-03-04 18:49:15.451 PST
"2022/03/05 02:49:15 Found artifact: dataflow_python_sdk.tar "
Info
2022-03-04 18:49:15.451 PST
"2022/03/05 02:49:15 Installing setup packages ... "
Info
2022-03-04 18:49:15.460 PST
"2022/03/05 02:49:15 Cleaning up temporary venv ... "
Info
2022-03-04 18:49:15.490 PST
"2022/03/05 02:49:15 Failed to install required packages: failed to install SDK: fork/exec /var/opt/google/beam-venv/beam-venv-worker-sdk0/bin/pip: permission denied " "
Jenkins logs: https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont_PR/227/ (although these don't contain the Dataflow job logs)
Strange. Does Dataflow provision local file systems that prevent setting the +x flag on file?
Actually, can you point me to the correct logs? I cannot find the error you are referring to. Only seemingly unrelated errors.
Dataflow just spins up a docker container using kubelet, no special filesystem configuration. I am afraid you might not be able to see the logs since they require viewer access to apache-beam-testing GCP project, but if you have a Google account, I can try to add your email, feel free to send me an email.
Let's make sure the error is still reproducible with latest changes
Run Python Dataflow ValidatesContainer
Looks like it indeed still fails. Actually, Dataflow does do some configuration. This may have to do with how the 'semi-persistent' directory is mounted into the container. I think in Dataflow's case it is configured as additional docker volume.
Unrelated: is the PR description up to date?
I updated the description.
@tvalentyn What's the plan now? I can't really debug Google Dataflow on my own.
I'll take a look or find somebody who can help.
@tvalentyn Sorry for nagging, but any news?
@tvalentyn is currently OOO.
@kerrydc do you know if someone else could help?
I spent some time debugging this today and got this far:
Dataflow container pod for SDK container has the following mount (from container manifest):
"volumeMounts": [ {
"mountPath": "/var/opt/google",
"name": "persist"
},
...
"volumes": [ {
"hostPath": {
"path": "/var/opt/google/dataflow"
},
"name": "persist"
}
The above maps a persistent directory /var/opt/google/dataflow
on the host VM into a directory /var/opt/google
in the running containers. This directory is passed to sdk harness container in --semi_persist_dir=/var/opt/google
param.
We can see this directory referenced in the errors:
"2022/04/08 20:40:15 Failed to install required packages: failed to install SDK: fork/exec /var/opt/google/beam-venv/beam-venv-worker-sdk-0-0/bin/pip: permission denied
I can reproduce this behavior if I manually SSH into a VM created by Dataflow, and manually start a container via:
docker run -it --entrypoint=/bin/bash -v /var/opt/google/dataflow:/var/opt/google apache/beam_python3.7_sdk
root@beamapp-valentyn-04082037-04081337-w2xj-harness-zq0w:/# python -m venv /var/opt/google/env
root@beamapp-valentyn-04082037-04081337-w2xj-harness-zq0w:/# /var/opt/google/env/bin/pip
bash: /var/opt/google/env/bin/pip: Permission denied
On the other hand creating a venv elsewhere works:
root@beamapp-valentyn-04082037-04081337-w2xj-harness-zq0w:/# python -m venv /var/opt/env
root@beamapp-valentyn-04082037-04081337-w2xj-harness-zq0w:/# /var/opt/env/bin/pip
Usage:
pip <command> [options]
Permissions on the files are identical:
root@beamapp-valentyn-04082037-04081337-w2xj-harness-zq0w:/# ls -al /var/opt/env/bin/pip
-rwxr-xr-x 1 root root 228 Apr 8 21:29 /var/opt/env/bin/pip
root@beamapp-valentyn-04082037-04081337-w2xj-harness-zq0w:/# ls -al /var/opt/google/env/bin/pip
-rwxr-xr-x 1 root root 235 Apr 8 21:29 /var/opt/google/env/bin/pip
but for some reasons, when we create a venv in the mounted directory, the pip
script cannot launch. I am not sure what is going on.
Here's a repro without Dataflow in the picture:
gcloud compute instances create valentyn-cos2 --zone=us-central1-f --project=<my gcp project> --image-family cos-stable --image-project=cos-cloud --restart-on-failure
gcloud compute ssh valentyn-cos2 --zone=us-central1-f --project=<my gcp project>
docker run -it --entrypoint=/bin/bash -v /tmp:/var/opt/google python:3.7-bullseye
python -m venv /var/opt/google/env
root@018e7f3cac24:/# /var/opt/google/env/bin/pip
bash: /var/opt/google/env/bin/pip: Permission denied
As for what to do next... @phoerious if you are interested and have access to a GCP project, you could continue investigate what happens in https://github.com/apache/beam/pull/16658#issuecomment-1093407575.
Another avenue, is to exclude the DataflowRunner from the new behavior. I think we should be able to cleanly detect the runner by analyzing pipeline options in https://github.com/apache/beam/blob/2d27f44f58112ca4a5689614a94592de4e28d476/sdks/python/container/boot.go#L138
Another possibility is to try using python -m pip install
in place of pip install
throughout. It may help as well.