asterinas icon indicating copy to clipboard operation
asterinas copied to clipboard

Evaluate using Profile-Guided Optimization (PGO) for Asterinas

Open zamazan4ik opened this issue 10 months ago • 5 comments

Hi. Thanks for the project!

Profile-Guided Optimization (PGO) is a compiler optimization that allows to use of runtime statistics for performing better compiler optimizations. I applied PGO to different software in different domains (including operating systems like Linux kernel) - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Since there are multiple claims that PGO helps operating systems like Linux and Windows, I think trying to evaluate PGO effects for Asterinas can be a valuable work since (according to the website) Asterinas aims to provide peak performance.

I can suggest the following action points:

  • Perform benchmarks with PGO for Asterinas. If it shows some positive effects - add a note to the documentation about PGO
  • Integrate building Asterinas with PGO into the build scripts
  • Optimize prebuilt binaries with PGO (if any) if it's possible to gather representative-enough workload.

By the way, did you think about enabling Link-Time Optimization (LTO) for the project too? This optimization can help with achieving better performance as well as PGO and can be easily enabled: https://doc.rust-lang.org/rustc/codegen-options/index.html#lto.

This issue is not a bug - it's just an idea of how Asterinas' performance can be improved.

zamazan4ik avatar Apr 16 '24 07:04 zamazan4ik

Thanks for your nice suggestion!

I think trying to evaluate PGO effects for Asterinas can be a valuable work since (according to the website) Asterinas aims to provide peak performance.

Asterinas warmly welcomes all contributions! While I am not familiar with PGO, I have taken a preliminary look at your article. I understand that PGO first instruments the program to gather runtime statistics and then recompiles it based on those statistics.

However, I have a couple of concerns:

  1. Will PGO generate different binaries for different benchmarks? For instance, if we run the Redis benchmark, will it generate one binary, and if we run the Nginx benchmark, will it generate another? It is worth noting that different benchmarks may elicit varying paths with different frequencies. However, an operating system is designed to handle all types of workloads, so generating a specific binary for each benchmark might not be ideal.
  2. If we were to include all the benchmarks we can think of for collecting statistics (assuming these benchmarks represent common workloads), would it significantly increase the compilation time?

did you think about enabling Link-Time Optimization (LTO) for the project too?

In theory, to achieve optimal performance, LTO should be enabled. However, the specific performance benefits it can bring are uncertain, and it would be beneficial to obtain evaluation results. Considering that LTO can potentially impact compilation time, I suggest adding an option to our build script that allows users to enable or disable LTO according to their preferences. This way, users can have control over whether LTO is utilized during compilation.

StevenJiang1110 avatar Apr 17 '24 03:04 StevenJiang1110

Thanks for the interest!

I understand that PGO first instruments the program to gather runtime statistics and then recompiles it based on those statistics.

Almost right, yes. There are two major PGO kinds: with Instrumentation and with Sampling. Instrumentation PGO is the most common PGO way and it requires a double compilation model (instrumentation + optimization phases), for Rustc this way is described here. Sampling PGO approach uses another way to collect runtime statistics, without instrumentation. Instead, it uses external profilers like Linux perf. In this case, you don't need to recompile your application twice. This approach is also known as AutoFDO and quite well is described in the Clang documentation.

Will PGO generate different binaries for different benchmarks? For instance, if we run the Redis benchmark, will it generate one binary, and if we run the Nginx benchmark, will it generate another? It is worth noting that different benchmarks may elicit varying paths with different frequencies.

Yes, PGO will generate different binaries for different workloads exactly for the reasons that you mentioned since different workloads execute different paths on the operating system.

However, an operating system is designed to handle all types of workloads, so generating a specific binary for each benchmark might not be ideal.

That's true. However, still there are use cases for applying PGO on operating systems. At first, you can collect PGO profiles from all different workloads, merge PGO profiles into one, and then use this "generic" profile during the optimization phase. In this case, you will optimize the kernel for multiple projects at once. Next step - Application Specific Operating Systems (ASOS). You can optimize a kernel specifically for one application to extract as much performance as possible for concrete workloads. Is it helpful in practice? Yes, because in reality we often have servers for databases, servers for regular web backends, servers for MLs, etc. - for each of them we can prepare dedicated kernel versions. The idea is described in details in this paper.

I understand that prebuilding for such cases is not a viable option for you. Instead, I suggest implementing somehow a way to build Asterinas with PGO (like a dedicated build script switch or something like that). In this case, users will be able to perform PGO optimization according to their workloads.

If we were to include all the benchmarks we can think of for collecting statistics (assuming these benchmarks represent common workloads), would it significantly increase the compilation time?

If we are talking about PGO via Instrumentation, compilation times will be increased at least twice: instrumentation build + running benchmarks + release build. However, it should not be a critical thing since you usually don't want to perform PGO-optimized builds for each commit - running them per release is ok. Also, you can try to cache PGO profiles in the repo and reuse them later - so you don't need to spend time on generating PGO profiles each time. But be careful - if you choose this way you need to track that your saved PGO profiles are not stale; they are compatible with newer compiler versions (it's almost impossible to guarantee in practice so for each compiler update you will need to regenerate PGO profiles anyway), etc.

So enabling PGO-optimized build only for releases should be ok.

Considering that LTO can potentially impact compilation time, I suggest adding an option to our build script that allows users to enable or disable LTO according to their preferences. This way, users can have control over whether LTO is utilized during compilation.

+1. This will be the best way.

zamazan4ik avatar Apr 17 '24 09:04 zamazan4ik

For your reference, we have attempts months ago to support LLVM instrumentation (aimed for coverage at that time). But that was closed since:

  1. We haven't got scripts to dump data from the guest to host using network/vsock/9pfs, although we can access the instrumentation result in the Asterinas guest user space;
  2. Getting the coverage stats is not a main goal for us at that time.

If you want to tackle with the above issues, #382 may be a reference design (but currently the build system is replaced by OSDK, scripts in that PR won't work anymore). I am happy to provide more assists.

junyang-zh avatar Apr 19 '24 07:04 junyang-zh

  1. We haven't got scripts to dump data from the guest to host using network/vsock/9pfs, although we can access the instrumentation result in the Asterinas guest user space;

The vsock PR is being reviewed. So this obstacle is about to be resolved.

tatetian avatar Apr 22 '24 02:04 tatetian

@zamazan4ik Thanks for bringing PGO (and LTO as well) to our attention. This is a good suggestion.

Currently, our focus is to identify performance bottlenecks at the design and implementation levels as they are more fundamental to an OS kernel. But as the kernel grows in maturity, applying more advanced optimizations at the compiler level would definitely be helpful.

Let's keep this issue open until someone from the community takes this job of applying PGO to Asterinas.

tatetian avatar Apr 22 '24 02:04 tatetian