malt icon indicating copy to clipboard operation
malt copied to clipboard

Possible hang with OpenMP

Open aulwes opened this issue 1 year ago • 4 comments

Hi, Are there known issues with using OpenMP inside malt? I attempted to modify the code in SymbolSolver::solveNames() to parallelize the loop over the theCommands list to see if I could speedup the addr2line phase. However, it seems to hang. As a simple reproducer, I added this loop

#pragma omp parallel for for ( int c = 0; c < 8; ++c ) { int tid = omp_get_thread_num(); std::cerr << " thread " << tid << " going to sleep..." << std::endl; usleep(50000*(tid+1) + c*2000); }

right before the run loop to execute addr2line. I tried both Intel and GCC using '-fopenmp', but the above loop hangs as well. I set OMP_NUM_THREADS=4.

Thanks,Rob

aulwes avatar May 02 '24 18:05 aulwes

Hi Robe, thanks for reporting the try.

I remember I already tried once shortly a couple of years ago and seen what your describe (exactly for that part to speedup the symbol solving).

I would say, there is probably an issue to fix at that stage (on exit) not to enter again in the library if it does a malloc. Or possibly because we are unloaded very late due to LD_PRELOAD and things are already closed about openmp.

In principle this should already be the case but with what you describe (and what I remember about my own try) there is certainly a problem of re-entrance.

Any way I wanted to cleanup that part of the code due to some patched added by some others this year so I can make a new release soon so I can try to focus on that when returning mid may to see if I can also look on parallelism. I would just probably use pthread at this layer instead of openmp to limit interaction with extra components. But I can try first to see what is the status.

I have someone making me feedback to see if MALT can optionally run on MacOSX also and for sure we will have problem on that part of the code so I will anyway have interest to re-look that part end of the month.

Just question to understand, I looked myself on making symbol solving parallel when applying on a very large C++ code at CERN because it was taking long at the end of the run, the problem for you is similar I suppose ?

svalat avatar May 03 '24 09:05 svalat

Thank you for checking! What my colleague found is that the slowdown we're seeing is from using the nm tool to get the global variables, and not from addr2line. For one of the executable we're profiling, this nm step took over 20 minutes. What we discussed is whether this nm step could be done separately and cached ahead of time until malt needed it. Or, could we use objdump instead?

aulwes avatar May 03 '24 13:05 aulwes

Hum interesting to know. By looking I probably can use directly readelf if there is the debug symbols (at least for the part which has). Which also adds something missing in MALT : the source origin of the global variable.

https://stackoverflow.com/questions/11003376/extract-global-variables-from-a-out-file/11056685#11056685

Can you measure on you problematic case how much cost readelf -s and readelf -w and nm --print-size -l -n -P --no-demangle on the various libs & the executable assembled ?

I suppose it comes from one bug lib or the exec itself ? Or parallelism is a solution to reduce a lot because multiple files are concerned ?

What you propose looks interesting, for the globs we can also offer options to either :

  1. Not track global vars (this is required once in first study in principle then most of the time we don't want to look at it anymore)
  2. As you propose use a cache at least for the fixed libs.... which are not recompiled base on MD5SUM of the object. But but the executable itself if the cost comes from it I think this is harder to decide even if we can offer an option.

svalat avatar May 03 '24 16:05 svalat