vtr-verilog-to-routing
vtr-verilog-to-routing copied to clipboard
[CI] Investigating CI Runners
The self-hosted runners were down for the last couple of days and has only now gotten back up. I wanted to investigate any anomolies in the logs of the CIs to see if we have any issues in the testcases we are running which my have caused it.
The motivation behind this investigation is this message produced by the CI when the self-hosted runners were not working:
The self-hosted runner: gh-actions-runner-vtr_XX lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
I went through the logs of the last working nightly test on the master branch ( https://github.com/verilog-to-routing/vtr-verilog-to-routing/actions/runs/9932067866 ) and here are the results for the jobs run on the self-hosted runners (this data was collected from the figures at the bottom of the logs). I also collected their run time since I thought it may be valuable.
Job Name | Average CPU Usage (%) | Max RAM Usage (GB) | Max /dev/sda2 Usage (GB) | Max eth0 Data Received (Mb/s) | Max eth0 Data Sent (Mb/s) | Test Run Time |
---|---|---|---|---|---|---|
"Capacity" | 100 | 125.82 | 492.0 | - | - | - |
Run-tests (vtr_reg_nightly_test1, 16) | 32.58 | 6.74 | 32.4 | 577.73 | 69.72 | 2h 24m 25s |
Run-tests (vtr_reg_nightly_test1_odin, 16, -DWITH_ODIN=ON) | 43.56 | 7.57 | 40.89 | 546.61 | 64.28 | 3h 5m 12s |
Run-tests (vtr_reg_nightly_test2, 16) | 48.09 | 100.03 | 97.97 | 630.83 | 33.65 | 4h 20m 20s |
Run-tests (vtr_reg_nightly_test2_odin, 16, -DWITH_ODIN=ON) | 54.27 | 98.35 | 97.88 | 789.33 | 64.23 | 3h 39m 22s |
Run-tests (vtr_reg_nightly_test3, 16) | 64.54 | 16.66 | 33.2 | 551.44 | 69.02 | 2h 0m 3s |
Run-tests (vtr_reg_nightly_test3_odin, 16, -DWITH_ODIN=ON) | 45.93 | 11.81 | 39.16 | 789.53 | 44.53 | 3h 9m 4s |
Run-tests (vtr_reg_nightly_test4, 16) | 44.29 | 53.45 | 49.67 | 789.48 | 41.61 | 3h 15m 17s |
Run-tests (vtr_reg_nightly_test4_odin, 16, -DWITH_ODIN=ON) | 46.42 | 14.11 | 37.86 | 554.0 | 33.6 | 1h 15m 5s |
Run-tests (vtr_reg_nightly_test5, 16) | 47.6 | 85.94 | 38.1 | 789.78 | 58.08 | 3h 20m 28s |
Run-tests (vtr_reg_nightly_test6, 16) | 19.02 | 74.72 | 32.35 | 692.52 | 6.35 | 4h 15m 20s |
Run-tests (vtr_reg_nightly_test7, 16) | 66.39 | 38.6 | 35.68 | 556.99 | 36.52 | 50m 9s |
Run-tests (vtr_reg_strong, 16, -DVTR_ASSERT_LEVEL=3, libeigen3-dev) | 42.67 | 8.13 | 5.76 | 507.23 | 64.27 | 15m 10s |
Run-tests (vtr_reg_strong_odin, 16, -DVTR_ASSERT_LEVEL=3 -DWITH_ODIN=ON, libeigen3-dev) | 31.59 | 7.71 | 32.27 | 582.84 | 50.31 | 19m 52s |
Run-tests (vtr_reg_strong_odin, 16, -skip_qor, -DVTR_ASSERT_LEVEL=3 -DVTR_ENABLE_SANITIZE=ON -DWI... | 63.43 | 20.03 | 32.31 | 756.76 | 56.57 | 1h 4m 28s |
Run-tests (vtr_reg_system_verilog, 16, -DYOSYS_F4PGA_PLUGINS=ON) | 29.12 | 8.96 | 32.32 | 789.53 | 12.13 | 22m 18s |
Run-tests (odin_reg_strong, 16, -DWITH_ODIN=ON) | 7.59 | 17.07 | 15.74 | 286.56 | 12.24 | 1h 1m 3s |
Run-tests (parmys_reg_strong, 16, -DYOSYS_F4PGA_PLUGINS=ON) | 3.66 | 26.3 | 31.58 | 789.63 | 10.23 | 2h 47m 31s |
The biggest thing that catches my eye is the RAM usage for some of the tests are very close to (what I think to be) the capacity of the machine (125 GB). This is caused by each job using 16 cores to run each test. I doubt this is what caused the problem, since we still have some head room.
I also noticed that few tests take longer than others. Just something to note down.
My biggest concern is that, since some of these jobs are so close to the limit; changes people may be making locally in their PRs while developing may cause the CI to have some issues. For example, if someone accidentally put a memory leak in their code while developing and push the code without testing locally it may bring down the CI. This does not appear to be what happened here since the last run of the CI succeeded without such issues.
I wanted to raise this investigation as an issue to see what people think.