gatb-core icon indicating copy to clipboard operation
gatb-core copied to clipboard

A different parallelisation granularity in GATB

Open leoisl opened this issue 3 years ago • 2 comments

Dear GATB team,

In pandora, we use GATB to make a local assembly of de-novo variants. We are facing a performance issue where we have to perform several thousands local assemblies using GATB. Removing some outliers that we are currently dealing with, usually these local assemblies are made on small graphs, and we can process each one of them very fast (usually in less than 1 second, in some cases more, but 5 seconds is the upper limit removing some outliers). This is all using 1 thread (giving -nb-cores 1 when building the graph). Our performance problem arises from the fact that we have to perform this small assembly several thousand times, adding up to a considerable runtime.

A natural way to speed this up is multithreading the processing of all these local assemblies, but we are facing issues with GATB in this case. I think it is reasonable that GATB was not designed to have several graphs built by different threads simultaneously in memory, as the general use case I guess is a huge graph built from NGS reads, instead of thousands of small graphs as is our particular use case. Anyway, a minimum working example where we can reproduce the issues we are having can be found here, where we simply start 8 threads, and each one tries to build the same graph, but we get several types of runtime errors. If we comment the #pragma omp line, everything runs single-threadedly and well.

I think we already identified and solved one issue with this type of multithreading. Running strace -y -t -e trace=open,close, we could identify that GATB created several temporary (trashme*) files using the process name (see https://github.com/GATB/gatb-core/blob/7cb8a48cc958f4266c74db57f02e937d583ba9b7/gatb-core/src/gatb/system/impl/FileSystemCommon.cpp#L185). This means that all threads would read/write to the same files simultaneously, which would certainly cause crashes. We tried to fix this with these changes, and this part seems to have worked, as strace shows that temporary files are now created with different names.

However, even with these changes, we still get several errors. We looked at these errors by running our multithreading example in debug mode and looking at the stack frames when a segmentation fault happened. In several cases, the errors were within the HDF5 library, which I think was not compiled with the threadsafe parameter turned on, but this answer on SO seems to show that is hard to have multiple threads accessing HDF5 library in C++. Other errors were memory corruption errors (e.g. double frees, destructors called twice, etc), but it seems that all these errors are related with some GATB singleton objects, like https://github.com/GATB/gatb-core/blob/7cb8a48cc958f4266c74db57f02e937d583ba9b7/gatb-core/src/gatb/tools/misc/impl/HostInfo.hpp#L54

which makes sense.

This short investigation made us realise that enabling this type of parallelisation in GATB might be more complicated than we thought, and might need a considerable time investment, which we can't afford to do it. We would like to consult you whether you think this is indeed a complicated issue to solve that needs considerable development, or if it is only a few changes here and there.

Thanks for reaching until here! I owe you a cookie when we meet again :p !

leoisl avatar Oct 28 '20 00:10 leoisl

I think this issue is likely related to only one thread being able to use the HDF5 library at a time. According to https://support.hdfgroup.org/HDF5/faq/threadsafe.html HDF5 can be built to be threadsafe, but a couple of flags need to be turned on.

@rchikhi do you know if this option is turned on for GATB?

mbhall88 avatar Nov 03 '20 06:11 mbhall88

Hi @leoisl, @mbhall88, sorry for dropping the ball on this. Coming back to it, I can comment on a couple of points:

  • we did look at threadsafe some years ago, and the way to activate it is to uncomment this line: https://github.com/GATB/gatb-core/blob/7cb8a48cc958f4266c74db57f02e937d583ba9b7/gatb-core/thirdparty/CMakeLists.txt#L6
  • otherwise nice investigation, it seems that the "thrashme" files would need to have some randomized identifier that doesn't just depend on the process Id. From my opinion, it's probably a few changes here and there regarding those files that are dumped on disk. My mental model of GATB is that everything regarding the Graph class is packaged into a neatly isolated object, so we shouldn't run into issues having 2 or more graphs at the same time.

Rayan

rchikhi avatar Feb 05 '21 16:02 rchikhi