ginkgo icon indicating copy to clipboard operation
ginkgo copied to clipboard

generate the test pipeline if the build job require it with specific runner

Open yhmtsai opened this issue 1 year ago • 21 comments

This PR is to add the flexibility if horeka is down.

I have tried some possible way, but they do not always work as my initial thought

  1. (Work) check the server status, run the two build/test jobs via slurm or the build+test in our own server adds some fallback job if horeka is not reachable. checks the reachability of horeka before launching the horeka related jobs. When using on_failure with needs, the job will be skipped if one of job is skipped Thus, the fallback can only rely on the check. https://gitlab.com/gitlab-org/gitlab/-/issues/388866 https://gitlab.com/gitlab-org/gitlab/-/issues/36610
  2. (Should Work but limit resource) unified build between docker and slurm, which select the build script from CI_RUNNER_TAGS Due to current slurm runner design, it will pass the script from gitlab to runner. Thus, it will combine build and test on gpu partition, which is not good for the resource. Not sure it is easy or not, if scripts write gpu_scripts to cache, slurm runner will run the generated script with gpu job additionally.
  3. (Not Work)runner dyanmic tag from https://docs.gitlab.com/ee/ci/yaml/#tags, it can support CI/CD variable. However, it is decided when job create not run. After creating the manual job, resetting the runner_tag works but it does not affect runner selection private_ci and $RUNNER_TAG are used in tags -https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/4141132266 image in the right side, the tags are still private_ci status-jobs, but the RUNNER_TAG is shown as nla-gpu -https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/jobs/4141192635 image if we do not define the default RUNNER_TAG, it will use plain text ${RUNNER_TAG} directly and still can not be overwritten in the runner selection. -The pipeline variable should work because it is overwrited before job creation. However, we do not know the server status before the job creation. (or, one additional pipeline to trigger the current pipeline) -It is similar to the rules, after generation, the job is decided to create or not already -rules can not depends on artifact, so we can not use artifact like variable to select job. https://gitlab.com/gitlab-org/gitlab/-/issues/215100
  4. (Not work for the github status) run manual job by job according to CI_RUNNER_TAGS have the corresponding test job in manual mode, if build job requires it based on CI_RUNNER_TAGS, run the test job via API. However, the manual job can be allow_failure or not. if it allows failure, we will not know the status from the github status. if it doesn't allow, it will show blocked (failed/pending) if they are not executed.
  5. (Work) create the child test pipeline If build job requests the specific runner to run the test, it will write a small job into yml. Thus, each job can use different runner. The later job collects these to generate child pipeline. note. if no job is generated, it will add a dummy job to avoid empty pipeline issue. Note. to upload artifact, the host server also needs to install gitlab-runner.

TODO:

  • [x] clean it up and apply it to all related jobs

yhmtsai avatar Apr 18 '23 13:04 yhmtsai

Actually, a unified build is not too hard because gitlab provides CI_RUNNER_TAGS. Thus, we can select the slurm script when CI_RUNNER_TAGS contains horeka or slurm or normal scrip otherwise. However, it will need to merge the build and test on horeka. @tcojean

KEEP_CONTAINER: "ON"
USE_SLURM: 0

are only used for build

SLURM_PARTITION: "accelerated"
SLURM_GRES: "gpu:1"
SLURM_TIME: "02:00:00"

are only used for tests. They will not infect each other, right?

yhmtsai avatar Apr 18 '23 14:04 yhmtsai

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

sonarqubecloud[bot] avatar Apr 21 '23 17:04 sonarqubecloud[bot]

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.03 :warning:

Comparison is base (914def9) 91.35% compared to head (64bde25) 91.33%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1324      +/-   ##
===========================================
- Coverage    91.35%   91.33%   -0.03%     
===========================================
  Files          584      584              
  Lines        49446    49412      -34     
===========================================
- Hits         45170    45129      -41     
- Misses        4276     4283       +7     
Impacted Files Coverage Δ
core/test/solver/multigrid.cpp 88.70% <ø> (-0.90%) :arrow_down:
include/ginkgo/core/solver/multigrid.hpp 100.00% <ø> (ø)
core/solver/multigrid.cpp 89.01% <100.00%> (-0.16%) :arrow_down:
reference/test/solver/multigrid_kernels.cpp 92.61% <100.00%> (-0.17%) :arrow_down:

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Apr 21 '23 17:04 codecov[bot]

Would it not be better to have the variables from the job itself ? I dont see a reason why the additional complexity of first filling a .yml file and then substituting the variables with a script is necessary ?

@pratikvn Could you elaborate on it?

yhmtsai avatar Apr 27 '23 08:04 yhmtsai

If I understand what you are doing, in the test/generate_test_pipeline, you generare the yml files based on whether horeka is available or not and filling in the variables. I am wondering if it possible to have the yml files generated beforehand (one for horeka and one for the amdci) and just directly run the test jobs without having the generate_test_pipeline. This would also avoid the varsub stuff in the unified_template ?

pratikvn avatar Apr 27 '23 09:04 pratikvn

Generating the job is based on which runner picks the job up (runtime). Thus, it can be some jobs on horeka, some on amdci, and some on other machines. generate_test_pipeline only combines the file from several jobs. Each following job from the build_job can be pre-generated, but the generate_test_pipeline needs to decide which job to run. If I understand it correctly, the way you mentioned is like the first option, check the availability of horeka and then run the corresponding pipeline (all jobs on one machine in a pipeline). After we have more machines using slurm, we also need work dispatcher for different machines in the checking step. we can not directly instantiate the test job because the job in the same pipeline can not use the variable or file from the previous job before instantiation currently. The horeka job and amdci job are in the queue when generating the pipeline, so the horeka test job will be failed/hang when horeka is unavailable.

yhmtsai avatar Apr 27 '23 10:04 yhmtsai

My point is mainly that we know beforehand the variables we need. Instead of the job which generates the yml file filling in the variables, we can just select the one for the appropriate machine that we already have prepared in the repo.

But I guess this is more of my preference, than any concern, so I am happy with your current approach.

pratikvn avatar Apr 27 '23 13:04 pratikvn

okay, so you would like to have the pre-generated job yml. we still need some runtime variable replacement due to the USE_NAME, which is related to the build_job pipeline id. Others can be pre-generated for different machines. We can pre-generate these for A-horeka, B-horeka, C-horeka, and A-other, B-other, C-other when the other slurm comes. Based on the runner tag, choose A-horeka or empty. A-horeka, B-horeka, C-horeka needs to be separated, or I need to write some script to collect the certain job part. I still need to change the USE_NAME in build_job but others can be statically generated in this way

yhmtsai avatar Apr 27 '23 13:04 yhmtsai

I see. Then let's just leave it as it is. Thanks for the clarifications.

pratikvn avatar Apr 27 '23 13:04 pratikvn

Actually, a unified build is not too hard because gitlab provides CI_RUNNER_TAGS. Thus, we can select the slurm script when CI_RUNNER_TAGS contains horeka or slurm or normal scrip otherwise. However, it will need to merge the build and test on horeka. @tcojean

KEEP_CONTAINER: "ON"
USE_SLURM: 0

are only used for build

SLURM_PARTITION: "accelerated"
SLURM_GRES: "gpu:1"
SLURM_TIME: "02:00:00"

are only used for tests. They will not infect each other, right?

To answer this, if you merge build and test, that means you want to run both on a compute node? In that case, you will need only the second part (the one for test). The USE_SLURM: 0 is obviously to run the build on the login node. The KEEP_CONTAINER: ON is to keep the container in storage so that we reuse the built image in the test.

tcojean avatar Jun 16 '23 12:06 tcojean

I fully agree with merging build and test jobs, and if we use a full compute node, the builds also become blazingly fast :sparkles: That also saves us the complexity of interdependent jobs

upsj avatar Jun 16 '23 13:06 upsj

@tcojean Yes, it is easy for merging them. However, the bill or limitation of computing resources is the issue to avoid the merging way when I ask for this usage. That's why I go the generated pipeline in the end.

yhmtsai avatar Jun 19 '23 11:06 yhmtsai

Do we want to keep working on this, or can this PR be closed?

MarcelKoch avatar Jul 04 '23 14:07 MarcelKoch

Asking again, will this be updated, or should it be closed? If there is no comment in 1 week, I will close this PR.

MarcelKoch avatar Sep 12 '23 07:09 MarcelKoch

IMO we should be aiming to make our pipelines more uniform so they work across different gitlab runner executors and hardware platforms. My current plans for the CI setup goes in the same direction, where this change would become superfluous.

upsj avatar Sep 12 '23 07:09 upsj

If we do not have limit on CI resources from the slurm system, unified build is easy and good as the second point I mentioned. Unfortunately, we have limited resources in the project for our CI pipeline, and I do not find a better way to handle two/one stage for build uniformly.

yhmtsai avatar Sep 12 '23 07:09 yhmtsai

We have so far used only 24% of our CPU time and 80% of our GPU time - CPU time is not our limiting factor here, so it doesn't hurt to simplify our CI pipelines.

upsj avatar Sep 12 '23 08:09 upsj

It is considered from the build time and test time. From the current develop pipeline https://gitlab.com/ginkgo-project/ginkgo-public-ci/-/pipelines/986527225 with mpi, build: 27min, test: 58 min -> require 46% more if we combine them clang without mpi, build: 33min, test: 8 min (why is it so fast?) -> require 400% more intel without mpi, build: 43min, test: 70min -> require 61% more It may use less compile time when we build it in node but it may also use less time for testing with ctest_resource. From the number you gave, it requires 30% more (CPU time for build and GPU time for testing I assume? although the summation is not 100%)

yhmtsai avatar Sep 12 '23 10:09 yhmtsai

I don't think these numbers provide any useful information - they are for shared builds on a login node with something like 8 threads, while a full node has > 100 cores. Together with ctest resources, I expect this to cut down even our slowest build/test executions to < 30 min, release builds probably even < 10 min. On top of Horeka, we also have FTP resources, which don't cut into our hours at all.

upsj avatar Sep 12 '23 10:09 upsj

I know it's not the real case for combining build/test on computing nodes, but it is the only real number we have for guessing the situation. As I said, it may reduce build time on compute node but also reduce testing time with ctest_resource. I do not have access to the project on horeka I do not know what it is the real number on that.

yhmtsai avatar Sep 12 '23 10:09 yhmtsai

Putting some numbers behind it, building Release with CUDA 11.8 + GCC 11 on Horeka takes ~400s and running all tests with ctest resources also takes ~400-430 s. (debug build performance should be similar, with tests probably at least 2-4x slower)

upsj avatar Sep 12 '23 12:09 upsj