vtr-verilog-to-routing [CI] Investigating CI Runners

[CI] Investigating CI Runners

Open AlexandreSinger opened this issue 7 months ago • 5 comments

The self-hosted runners were down for the last couple of days and has only now gotten back up. I wanted to investigate any anomolies in the logs of the CIs to see if we have any issues in the testcases we are running which my have caused it.

The motivation behind this investigation is this message produced by the CI when the self-hosted runners were not working:

The self-hosted runner: gh-actions-runner-vtr_XX lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

I went through the logs of the last working nightly test on the master branch ( https://github.com/verilog-to-routing/vtr-verilog-to-routing/actions/runs/9932067866 ) and here are the results for the jobs run on the self-hosted runners (this data was collected from the figures at the bottom of the logs). I also collected their run time since I thought it may be valuable.

Job Name	Average CPU Usage (%)	Max RAM Usage (GB)	Max /dev/sda2 Usage (GB)	Max eth0 Data Received (Mb/s)	Max eth0 Data Sent (Mb/s)	Test Run Time
"Capacity"	100	125.82	492.0	-	-	-
Run-tests (vtr_reg_nightly_test1, 16)	32.58	6.74	32.4	577.73	69.72	2h 24m 25s
Run-tests (vtr_reg_nightly_test1_odin, 16, -DWITH_ODIN=ON)	43.56	7.57	40.89	546.61	64.28	3h 5m 12s
Run-tests (vtr_reg_nightly_test2, 16)	48.09	100.03	97.97	630.83	33.65	4h 20m 20s
Run-tests (vtr_reg_nightly_test2_odin, 16, -DWITH_ODIN=ON)	54.27	98.35	97.88	789.33	64.23	3h 39m 22s
Run-tests (vtr_reg_nightly_test3, 16)	64.54	16.66	33.2	551.44	69.02	2h 0m 3s
Run-tests (vtr_reg_nightly_test3_odin, 16, -DWITH_ODIN=ON)	45.93	11.81	39.16	789.53	44.53	3h 9m 4s
Run-tests (vtr_reg_nightly_test4, 16)	44.29	53.45	49.67	789.48	41.61	3h 15m 17s
Run-tests (vtr_reg_nightly_test4_odin, 16, -DWITH_ODIN=ON)	46.42	14.11	37.86	554.0	33.6	1h 15m 5s
Run-tests (vtr_reg_nightly_test5, 16)	47.6	85.94	38.1	789.78	58.08	3h 20m 28s
Run-tests (vtr_reg_nightly_test6, 16)	19.02	74.72	32.35	692.52	6.35	4h 15m 20s
Run-tests (vtr_reg_nightly_test7, 16)	66.39	38.6	35.68	556.99	36.52	50m 9s
Run-tests (vtr_reg_strong, 16, -DVTR_ASSERT_LEVEL=3, libeigen3-dev)	42.67	8.13	5.76	507.23	64.27	15m 10s
Run-tests (vtr_reg_strong_odin, 16, -DVTR_ASSERT_LEVEL=3 -DWITH_ODIN=ON, libeigen3-dev)	31.59	7.71	32.27	582.84	50.31	19m 52s
Run-tests (vtr_reg_strong_odin, 16, -skip_qor, -DVTR_ASSERT_LEVEL=3 -DVTR_ENABLE_SANITIZE=ON -DWI...	63.43	20.03	32.31	756.76	56.57	1h 4m 28s
Run-tests (vtr_reg_system_verilog, 16, -DYOSYS_F4PGA_PLUGINS=ON)	29.12	8.96	32.32	789.53	12.13	22m 18s
Run-tests (odin_reg_strong, 16, -DWITH_ODIN=ON)	7.59	17.07	15.74	286.56	12.24	1h 1m 3s
Run-tests (parmys_reg_strong, 16, -DYOSYS_F4PGA_PLUGINS=ON)	3.66	26.3	31.58	789.63	10.23	2h 47m 31s

The biggest thing that catches my eye is the RAM usage for some of the tests are very close to (what I think to be) the capacity of the machine (125 GB). This is caused by each job using 16 cores to run each test. I doubt this is what caused the problem, since we still have some head room.

I also noticed that few tests take longer than others. Just something to note down.

My biggest concern is that, since some of these jobs are so close to the limit; changes people may be making locally in their PRs while developing may cause the CI to have some issues. For example, if someone accidentally put a memory leak in their code while developing and push the code without testing locally it may bring down the CI. This does not appear to be what happened here since the last run of the CI succeeded without such issues.

I wanted to raise this investigation as an issue to see what people think.

Jul 18 '24 02:07 AlexandreSinger

vtr-verilog-to-routing vtr-verilog-to-routing copied to clipboard

[CI] Investigating CI Runners

vtr-verilog-to-routing
vtr-verilog-to-routing copied to clipboard