envoy
envoy copied to clipboard
ci: support variable number of CPUs for building
Currently, Envoy uses all available CPUs - 1 for building, which can lead to OOM crashing if the system does not have enough memory.
This PR allows the user to pass a -j option to do_ci.sh that parses and sets the number of CPUs used by bazel to build and test envoy.
Usage: ENVOY_DOCKER_BUILD_DIR=build ./ci/run_envoy_docker.sh './ci/do_ci.sh dev -j 24'
Hi @leonematt, welcome and thank you for your contribution.
We will try to review your Pull Request as quickly as possible.
In the meantime, please take a look at the contribution guidelines if you have not done so already.
Thanks for the response and I have a few quick questions:
What is the difference between a job running on a remote vs local machine?
I see that NUM_CPUS is already used as a variable in the build script. Since NUM_CPUS is already calculated, would connecting it to job control improve the contributor experience? I am still new to bazel so its good that I am figuring this stuff out, but I imagine it would be helpful to make sure that users that want to contribute have the least amount of friction towards contributing and the j flag is common across popular projects that use make, cmake, etc.
Do you think it would be better to add to the documentation that you can change the the bazelrc files in case the developer runs out of memory, or maybe even implement an optimizer that figures out what the limiting factor is for building, either CPU or Memory, and then picking the number of CPUs off that comparison?
in a local setup jobs=num of cpus
in remote jobs=num of workers (irrespective of their core count)
I am still new to bazel so its good that I am figuring this stuff out,
im not exactly new to it, but still figuring this stuff out 8/
Do you think it would be better to add to the documentation that you can change the the bazelrc files ...
yeah i think this is probably the best we can do - ive just been testing locally - and even with --jobs=200 its plodding along nicely using all my cores (16) - so really not sure how we can set this any better
im defo aware of the issue - i think someone else may have asked about it recently, and im pretty sure ive hit from time to time
unfortunately our dev docs are a bit scattered - theres a couple of unrelated mentions about user.bazelrc - perhaps we could put something explicit/prominent about OOMing and reducing cores/jobs
bazel/README.md: echo "build --config=clang" >> user.bazelrc
bazel/README.md:If you want to make libc++ as default, add a line `build --config=libc++` to the `user.bazelrc` file in Envoy source root.
bazel/README.md:You may persist those options in `user.bazelrc` in Envoy repo or your `.bazelrc`.
ci/README.md:1. add bazel options like `--remote_cache` and/or `--remote_cache_header` to a `.bazelrc` file, such as `user.bazelrc`. For example
I see. Let me get back to you on this soon. I didn't even know distributed builds are a thing, so this is introducing some complexity I need to figure out.
For your local build machine, how many gigabytes of RAM do you have? It might be the case that your builds generally succeed but sometimes fail because your system can only run 16 processes at a single time but you have enough RAM such that you only run out of memory infrequently.
Could you also do me a favor and run a local build with 20 jobs and tell me the run time using time or perhaps bazel has a timestamp feature? This will allow us to figure out if you are benefitting from having 200 jobs running locally versus running 20 jobs locally.
I ran htop while running my builds to figure out why they were failing, and I could see that all 32 CPUs that I had were saturated but I only have 32 GB of RAM in my machine and I saw that getting filled and then the build failing.
If what I am thinking is correct, I think we should implement a solution so that developers cannot fail to build a freshly-cloned repo.
many gigabytes of RAM do you have? It might be the case that your builds generally succeed
32GB and 16 cores - i think bazel can take 3-4G per core (iirc) - there are some large jobs, most are pretty small mem wise but if you gt the memhogs running at the same time then it can OOM
tbf i mostly build using the project rbe as im working on the project
Could you also do me a favor and run a local build with 20 jobs and tell me the run time using time or perhaps
im afraid i dont have the cycles for this, and there are way too many pieces of string in play to be of much use
Envoy is painful to build at first, once you have a cache it gets a lot easier
I understand. Do you think it would be a better idea to change this to the dynamic limiter implementation that picks the maximum resources while maintaining guaranteed success on builds? I think that would bring resiliency to the project and now the user would not be confused about local vs remote builds.
I think this is waiting on you @phlax
im not sure its the right solution for the right problem
rn - it uses all cpus-1 - not sure this would actually change that - it would just prevent it from being applied unless you used the ci/ path - this is kinda the opposite of what we might want to change (not convinced we do want to change anything here tho)
i honestly think the best path forward is to add better info about customizing bazel options with user.bazelrc, particularly wrt to ooming
i just dont think there is a perfect solution here - mostly, given the time it takes to build envoy, but also for devs re/building specific components, you want to use all available resources
i think this can lead to OOM if your cpu count is high relative to mem, or for whatever reason you trigger the most mem hungry builds simultaneously - but i think we have to balance that with the more common case of just wanting to use whatever you have available
I agree with the fact we should use whatever resources are available. My proposed solution isn't to run while leaving resources on the table. My proposed solution is to run with the maximal amount of resources while still allowing the build to succeed. Does running with all of the system resources matter if the build fails? That is why I proposed the idea of finding the limiting hardware resource factor, be it CPUs, Memory, Storage, etc, and configure the system to maximize the resources while allowing the build to succeed. Its "Hey, we have a ton of CPUs but not enough memory to build with the number of jobs running, so let's set the CPUs to what the system can handle while still being able to build."
...so let's set the CPUs to what the system can handle while still being able to build.
the problem is that nearly all the jobs take very little memory - its really only a few in terms of build that do use a lot - so for the most part you just want to run on all cores - its really only when it tries to build more than one of these more memory hungry jobs that its an issue i think
tldr - i think our defaults are pretty sane, any resolution should happen in .bazelrc (if at all) as then it doesnt rely on using the ci path to be triggered, and in this case the best solution really is to customize either through adding flags on command line, setting the relevant env var, or if you always want to build a particular way then user.bazelrc is the place for that
saying all that, i really do think this could be better documented
Alrighty then, I'll submit a documentation update about this issue.
thanks @leonematt ! that would be really appreciated
i should also mention - we recently landed a PR to optimize foreign_cc builds (ie c/make etc non-bazel builds)
the optimization is to use all cores - which in some cases these might do already - thse are pretty high on candidate list for jobs that are likely to eat memory
so that our ci optimizations dont make things worse in this respect - i added a new flag --//bazel/foreign_cc:parallel_builds which defaults to false - but will assertively use all cores for these jobs if set to true
Interesting.
the optimization is to use all cores - which in some cases these might do already - thse are pretty high on candidate list for jobs that are likely to eat memory
Yeah, the local build system does this by default.
so that our ci optimizations dont make things worse in this respect - i added a new flag
--//bazel/foreign_cc:parallel_buildswhich defaults tofalse- but will assertively use all cores for these jobs if set totrue
What does this flag do on false and true, exactly? Does this flag allow build configurations to be tailored to available hardware resources on the nodes?
I almost imagine that it might be more descriptive to use a variable like number_build_nodes to specify that you are using a distributed build system, but the parallel_builds variable also makes sense.
Yeah, the local build system does this by default.
im not 100% clear about this (either way) - from my limited testing at least with cmake jobs i had to add the -j flag to get it to use all cores - it may depend on what the cmake recipe itself does - not sure.
likewise - not sure how it handles other types - eg make - i think the default is single core
... Does this flag allow build configurations to be tailored to available hardware resources on the nodes? ... use a variable like number_build_nodes ... ?
atm it uses a bool flag as that is just easier to implement, and my reason for adding was slightly different.
further it has a single flag for all foreign jobs, again for reasons of simplicity
bazel-sklib does have an int_flag so we could probably use that if we wanted to specify the number of cores, potentially with separate flags for any related tasks
thinking through current impl - im remembering why i did it this way - ie 1 or all ...
in the rbe/ci context we want all - similar for local dev that is making cache-blowing changes to one or some foreign builds
in the dont-OOM-my-machine context - eg end user doing a full build - bazel is already taking all cores (other than the one allocated to the foreign build) so you generally just want this set to 1
/wait-any
Waiting for any actions from @leonematt
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!
Hi @phlax, my apologies for the delayed response on this PR. I've been swamped with work and appreciate your patience. I had time this morning to finish the documentation update.
thanks for working on this @leonematt
ive added a few cleanups/clarifications/suggestions
Thanks for the review. I just added the changes. :)
Hi @agrawroh
Thank-you so much! Please don't merge yet because I do want to correct the PR and git commit history. Once I do that I will resolve the comments you made and we should be good to go!
Hi @agrawroh and @phlax
It is good to merge now that I updated the PR message and the commit message. Thank-you so much!
kindly ping @phlax, sounds like ready to go?
afaict feedback still needs to be addressed
/wait-any
Hi @phlax I totally missed @agrawroh's comments. I have just implemented them. Let me know if anything else needs to change for the documentation, please.
/retest