klayout icon indicating copy to clipboard operation
klayout copied to clipboard

Idea: try parallelisation with C++17 `std::execution`

Open nikosavola opened this issue 3 years ago • 5 comments

C++17 contains execution policies with easy parallelisation supported by methods like std::sort, see https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag_t. Additionally, for-loops that are refactorable to std::for_each without data races are easily parallelised.

As C++17 appears to be already quite supported by gcc and clang, would it make sense to test if parts of KLayout could be parallelised using this?

This would of course increase the required version of g++ and clang.

nikosavola avatar Jul 13 '22 09:07 nikosavola

Thanks, but do you have a proof of benfits (e.g. runtime improvement on critical execution paths), or is that a hypothetical idea? One suggestion is that you try that in the EdgeProcessor domain (std::sort is used there) and report runtime/parallelisation/memory effects.

I am not in favor of switching to C++17 myself. EDA users are conservative. Old compilers are widely used (e.g. the ones shipped with CentOS7).

klayoutmatthias avatar Jul 13 '22 19:07 klayoutmatthias

Thanks, but do you have a proof of benfits (e.g. runtime improvement on critical execution paths), or is that a hypothetical idea? One suggestion is that you try that in the EdgeProcessor domain (std::sort is used there) and report runtime/parallelisation/memory effects.

I am not in favor of switching to C++17 myself. EDA users are conservative. Old compilers are widely used (e.g. the ones shipped with CentOS7).

Yes, I tried initial profiling with Very Sleepy for a typical use case:

image

I can try adding the parallel (and/or vectorisation) policy to some of the places here and benchmarking this at some point.

nikosavola avatar Jul 14 '22 05:07 nikosavola

I also wonder if this matrix.to_string call is somehow inefficient or is it just called a lot of times https://github.com/KLayout/klayout/blob/f7ef538f343a208603288ed3925b561b00d2336a/src/db/db/dbMatrix.cc#L353-L360

nikosavola avatar Jul 14 '22 07:07 nikosavola

I cannot believe that this profile is real.

"matrix.to_string" is hardly ever called in a normal use case - like once when a session containing an image is persisted. I guess something is wrong with the symbols.

I usually use cachegrind for profiling. It will not only show the times but also how many times a function is called. However, as it implements instrumentation by virtual execution, the run times are many times slower while running in cachegrind.

In general, profiling is one, but not the best way for optimization. The by far largest optimization potential is through algorithmic improvements and profiling gives you some hints where to start with, but not much more.

The EdgeProcessor is a good candidate for optimization as it is a core component (so enhancements have a significant effect in many places) and is not too complex. A similar component is the box scanner which is also used in many places.

Matthias

klayoutmatthias avatar Jul 14 '22 21:07 klayoutmatthias

With regard to profiling, here’s a tool with an interesting approach. https://github.com/plasma-umass/coz

stefanottili avatar Jul 15 '22 06:07 stefanottili

Any updates? I'd close this issue otherwise. I don't see a way to implement that request. Matthias

klayoutmatthias avatar Dec 06 '22 23:12 klayoutmatthias

I haven't had time to experiment with this so I guess this can be closed until further notice.

nikosavola avatar Dec 07 '22 11:12 nikosavola