PcapPlusPlus icon indicating copy to clipboard operation
PcapPlusPlus copied to clipboard

Improve the CI time

Open tigercosmos opened this issue 1 year ago • 27 comments

Currently, we have too many GitHub action items, which results in a long building and testing time. We need to refactor the current CI files and make the CI faster.

tigercosmos avatar Jun 24 '24 02:06 tigercosmos

how can we improve it? Maybe we can list the action items so anyone can work on it

seladb avatar Jun 24 '24 05:06 seladb

@seladb let's brainstorm some ideas to improve this. cc @Dimi1010 @egecetin @clementperon

tigercosmos avatar Jun 24 '24 06:06 tigercosmos

I didn't look at the pipeline but using a cache + ccache could avoid to rebuild some files. Else limiting the number of OS and run a nightly pipeline with all the OS / configurations

clementperon avatar Jun 24 '24 06:06 clementperon

My suggestion is that if the cache is not enough, we can distribute the workflow to several providers to prevent reaching compute limits. For example,

  • Cirrus: only run doxy
  • Azure: only windows runners
  • Github: linux runners
  • and so on ...

It should make it harder to maintain a bit, but we can handle it if we have to

egecetin avatar Jun 24 '24 16:06 egecetin

@egecetin I think it could be the last step. The issue is the current CI script is not efficient enough. Just like that we need to improve the algorithm before trying to use multithread(?)

tigercosmos avatar Jun 24 '24 18:06 tigercosmos

@tigercosmos Yes of course. If possible, I agree we should improve the pipeline first. To be honest I did not check all workflows but most of the runs between 5-8 mins (approx 2-3 mins from build so it is like 30%-50% of run is build), right? So, I'm not sure we can dramatically improve CI runtime (even with caching) unless we limit the runner OS/version/config or improve parallelism. I think for "multiplatform" definition the code should be tested as much as possible. So, I don't know what we can do.

egecetin avatar Jun 24 '24 18:06 egecetin

Something that comes to my mind, testing consumes approx 1 minute. If I'm not mistaken current test suite does not support multithreading (@seladb correct me if I'm wrong pls). So maybe using a test library supporting multithreading might lead better testing times? (eg. integrating GoogleTest + ctest to run tests parallel to efficiently saturate CPU)

or maybe we can just remove "Check Installation" step which can halve the workflow time but not sure about this.

egecetin avatar Jun 24 '24 18:06 egecetin

@egecetin true, I think the parallelism of test cases may be the key. Probably we should focus on it in the next step.

tigercosmos avatar Jun 24 '24 18:06 tigercosmos

@seladb what do you think?

tigercosmos avatar Jun 26 '24 05:06 tigercosmos

The "Check installation" step re-builds the entire library, although it was built just a few steps earlier. Maybe we can use cache to prevent that?

Regarding the tests - they indeed take between 1 min (on Linux) and 4 mins (on Windows), so parallelism may help. But replacing the whole testing framework will take a very long time and I'm not sure it's worth the relatively small runtime gain...

seladb avatar Jul 04 '24 11:07 seladb

we have 39 checks now, if we can save one minute for each, then we save 40 mins. I think both cache and test parallelism are worth a try.

We don't need to switch a new testing framework, I think it's not hard to introduce parallelization in the current one.

tigercosmos avatar Jul 04 '24 15:07 tigercosmos

sure, we can try both. Hopefully it can save some of the CI time

seladb avatar Jul 05 '24 12:07 seladb

I will tried to work on the test parallelization

tigercosmos avatar Jul 05 '24 16:07 tigercosmos

@tigercosmos Maybe we can close this issue what do you think? With ccache and disabling old commit runners for PRs, CI performance improved and moving to Gtest is mostly related to improving test framework itself for now not for CI performance.

egecetin avatar Nov 05 '24 17:11 egecetin

@tigercosmos Maybe we can close this issue what do you think? With ccache and disabling old commit runners for PRs, CI performance improved and moving to Gtest is mostly related to improving test framework itself for now not for CI performance.

I agree we can close it 👍 @tigercosmos do you have any concern?

seladb avatar Nov 06 '24 08:11 seladb

I expect parallelism, but if we are surely moving to gtest, then this can be closed.

tigercosmos avatar Nov 06 '24 15:11 tigercosmos

I expect parallelism, but if we are surely moving to gtest, then this can be closed.

I assume you mean parallelize running the tests? It will only show benefit in Pcap++Test because Packet++Test is pretty fast. Still, I think the test runtime will not improve CI runtime by a lot because the bottleneck is mostly the time it takes to build the project which I'm not sure we can do a lot about. Here is an example breakdown: image

Even if we improve Test PcapPlusPlus by 20-30 seconds, it won't matter much to the overall CI runtime

seladb avatar Nov 08 '24 07:11 seladb

Even if we improve Test PcapPlusPlus by 20-30 seconds, it won't matter much to the overall CI runtime

We have many platforms; it's about 30s for now. If we can save 30 seconds for each, it turns out to be 15 minutes (of course, this number will be divided by the number of runners, which is 3 or 4). I think it's a huge improvement.

tigercosmos avatar Nov 08 '24 07:11 tigercosmos

Also, if we can improve the efficiency of the testing, it will run faster when we do the local testing, where the build time can be ignored in this case.

tigercosmos avatar Nov 08 '24 07:11 tigercosmos

Most of the runners are running in parallel, so even if we achieve a performance improvement of 20-30 seconds per run, it won't give more than 1-1.5 minutes in overall CI runtime. I'm not sure it's worth the effort...

seladb avatar Nov 08 '24 08:11 seladb

As previously mentioned, local testing is time-consuming, and we should improve its speed. We can either continue with this issue or open a new one specifically focused on reducing testing time.

tigercosmos avatar Nov 09 '24 08:11 tigercosmos

currently, we only bind to one interface, which restricts the possibility of parallelism, so we may bind to multiple interfaces (including virtual ones) if possible.

tigercosmos avatar Nov 09 '24 08:11 tigercosmos

I'm not sure it is easy to distribute tests to all interfaces and get consistent results, it will probably require more work 🤔. Just for curiosity what is current time approx for local environments. For me around 2.5 mins which is not bad, I think. Most of time I'm not testing whole library locally, just test changed module (Packet++ or Pcap++ only for example)

egecetin avatar Nov 09 '24 10:11 egecetin

Why would running each test case on different interfaces yield different results? We're parallelizing the test cases, but not each individual test case. Typically, I have to comment out any unnecessary tests to speed things up, which is inconvenient. Perhaps we could add a test-filtering flag, similar to what other testing frameworks provide.

tigercosmos avatar Nov 10 '24 11:11 tigercosmos

Why would running each test case on different interfaces yield different results? We're parallelizing the test cases, but not each individual test case. Typically, I have to comment out any unnecessary tests to speed things up, which is inconvenient. Perhaps we could add a test-filtering flag, similar to what other testing frameworks provide.

We do have test-filtering flags: -t and -x:

-t --include-tags        A list of semicolon separated tags for tests to run
-x --exclude-tags        A list of semicolon separated tags for tests to exclude

These tags can also be a single test, for example:

.\Bin\Pcap++Test.exe -i 1.1.1.1 -t TestIPAddress

Or:

.\Bin\Pcap++Test.exe -i 1.1.1.1 -t tcp_reassembly

Or:

.\Bin\Pcap++Test.exe -i 1.1.1.1 -t tcp_reassembly;ip_frag

seladb avatar Nov 10 '24 22:11 seladb

Thanks! It's good to know this. I guess we just need to support running tests in parallel. As there is still room for improvement in CI, I would suggest closing the issue later.

tigercosmos avatar Nov 11 '24 01:11 tigercosmos

@tigercosmos I thought the tcpreplay invoked by test python generates strictly specific traffic and pratically synchronized so thought it might fail if running async. But it looks like it is for just generating "some" traffic, my bad.

egecetin avatar Nov 11 '24 07:11 egecetin

I'm not sure we can do much about the CI time, and it's not too bad as is it. I'm closing this issue

seladb avatar Nov 15 '25 04:11 seladb