klayout Idea: try parallelisation with C++17 `std::execution`

C++17 contains execution policies with easy parallelisation supported by methods like std::sort, see https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag_t. Additionally, for-loops that are refactorable to std::for_each without data races are easily parallelised.

As C++17 appears to be already quite supported by gcc and clang, would it make sense to test if parts of KLayout could be parallelised using this?

This would of course increase the required version of g++ and clang.

Jul 13 '22 09:07 nikosavola

Thanks, but do you have a proof of benfits (e.g. runtime improvement on critical execution paths), or is that a hypothetical idea? One suggestion is that you try that in the EdgeProcessor domain (std::sort is used there) and report runtime/parallelisation/memory effects.

I am not in favor of switching to C++17 myself. EDA users are conservative. Old compilers are widely used (e.g. the ones shipped with CentOS7).

Jul 13 '22 19:07 klayoutmatthias

Thanks, but do you have a proof of benfits (e.g. runtime improvement on critical execution paths), or is that a hypothetical idea? One suggestion is that you try that in the EdgeProcessor domain (std::sort is used there) and report runtime/parallelisation/memory effects.

I am not in favor of switching to C++17 myself. EDA users are conservative. Old compilers are widely used (e.g. the ones shipped with CentOS7).

Yes, I tried initial profiling with Very Sleepy for a typical use case:

I can try adding the parallel (and/or vectorisation) policy to some of the places here and benchmarking this at some point.

Jul 14 '22 05:07 nikosavola

I also wonder if this matrix.to_string call is somehow inefficient or is it just called a lot of times https://github.com/KLayout/klayout/blob/f7ef538f343a208603288ed3925b561b00d2336a/src/db/db/dbMatrix.cc#L353-L360

Jul 14 '22 07:07 nikosavola

I cannot believe that this profile is real.

"matrix.to_string" is hardly ever called in a normal use case - like once when a session containing an image is persisted. I guess something is wrong with the symbols.

I usually use cachegrind for profiling. It will not only show the times but also how many times a function is called. However, as it implements instrumentation by virtual execution, the run times are many times slower while running in cachegrind.

In general, profiling is one, but not the best way for optimization. The by far largest optimization potential is through algorithmic improvements and profiling gives you some hints where to start with, but not much more.

The EdgeProcessor is a good candidate for optimization as it is a core component (so enhancements have a significant effect in many places) and is not too complex. A similar component is the box scanner which is also used in many places.

Matthias

Jul 14 '22 21:07 klayoutmatthias

With regard to profiling, here’s a tool with an interesting approach. https://github.com/plasma-umass/coz

Jul 15 '22 06:07 stefanottili

Any updates? I'd close this issue otherwise. I don't see a way to implement that request. Matthias

Dec 06 '22 23:12 klayoutmatthias

I haven't had time to experiment with this so I guess this can be closed until further notice.

Dec 07 '22 11:12 nikosavola

klayout klayout copied to clipboard

Idea: try parallelisation with C++17 `std::execution`

klayout
klayout copied to clipboard