gramtools icon indicating copy to clipboard operation
gramtools copied to clipboard

Multi-threading improvements

Open bricoletc opened this issue 6 years ago • 4 comments
trafficstars

  1. Following #127 , here is the speedup obtained on quasimap on different threads: Fold_speedup

The black line is the theoretical best speedup (= number of threads).

Dataset: 250,000 reads (so 500,000 with the reverse complements) from /nfs/leia/research/iqbal/bletcher/Pf_benchmark/all_reads/_original_/PG0496-C.bam mapped to pf3k prg build from /nfs/research1/zi/bletcher/Pf_benchmark/ref_data (yoda cluster)

Command: bsub -R select[mem>60000] rusage[mem=60000] -M60000 -J threads_4 -n 4 -o /nfs/leia/research/iqbal/bletcher/Pf_benchmark/tests/threads/logs/t4_k8.o \ -e /nfs/leia/research/iqbal/bletcher/Pf_benchmark/tests/threads/logs/t4_k8.e singularity exec /nfs/leia/research/iqbal/bletcher/Singularity_Images/8b46a86_gramtools.img gramtools quasimap \ --gram-dir /nfs/leia/research/iqbal/bletcher/Pf_benchmark/tests/gram_k8 --run-dir /nfs/leia/research/iqbal/bletcher/Pf_benchmark/tests/threads/run_t4 \ --reads /nfs/leia/research/iqbal/bletcher/Pf_benchmark/tests/subsetted_reads/PG0496-C.trim.fq.1.1MSubset.gz --max-threads 4 Commit 8b46a86

bricoletc avatar Apr 11 '19 14:04 bricoletc

@leoisl has noticed one (probably) bug and one possible improvement:

  1. [Bug] At https://github.com/iqbal-lab-org/gramtools/blob/master/libgramtools/src/quasimap/coverage/grouped_allele_counts.cpp#L52 : when an allele combination has not been seen before, an entry is created in the site_coverage map assigning counts to allele 'equivalence classes' (for a given variant site). If two different threads (processing two different read mappings to the graph) try to both do this, the program would crash.

Pre-allocating entries to 0 count does not seem reasonable since the number of equivalence classes is the power set of the number of alleles in a variant site.

  1. [Improvement] Scheduling. Right now, each thread gets just one read. We could give each thread a chunk of reads (eg 1K). This could be more efficient than having threads constantly being provided with single reads at a time.

bricoletc avatar Apr 15 '19 16:04 bricoletc

In response to this:

@leoisl has noticed one (probably) bug and one possible improvement:

  1. [Bug] At https://github.com/iqbal-lab-org/gramtools/blob/master/libgramtools/src/quasimap/coverage/grouped_allele_counts.cpp#L52 : when an allele combination has not been seen before, an entry is created in the site_coverage map assigning counts to allele 'equivalence classes' (for a given variant site). If two different threads (processing two different read mappings to the graph) try to both do this, the program would crash.

I have gotten two seg faults from ~100 separate instances of multi-threaded quasimapping to Plasmodium prg.

Here is the stack trace from gdb of the core dump of one of them (I do this from inside singularity container:/nfs/leia/research/iqbal/bletcher/Pf_benchmark/full_Cross/04ff58e_prod.img in yoda) :

gdb /usr/local/lib/python3.6/dist-packages/gramtools/bin/gram core.30852

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005647921b3ffe in std::_Hashtable<std::vector<unsigned int, std::allocator<unsigned int> >, std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long>, std::allocator<std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long> >, std::__detail::_Select1st, std::equal_to<std::vector<unsigned int, std::allocator<unsigned int> > >, gram::sequence_hash<std::vector<unsigned int, std::allocator<unsigned int> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node (__code=3864292196, __k=..., __n=2, this=0x2b797278fcd8)
    at /usr/include/c++/7/bits/hashtable.h:1548
1548	      for (__node_type* __p = static_cast<__node_type*>(__prev_p->_M_nxt);;

The core is also at /nfs/leia/research/iqbal/bletcher/Pf_benchmark/full_Cross/

And then:

(gdb) where
#0  0x00005647921b3ffe in std::_Hashtable<std::vector<unsigned int, std::allocator<unsigned int> >, std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long>, std::allocator<std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long> >, std::__detail::_Select1st, std::equal_to<std::vector<unsigned int, std::allocator<unsigned int> > >, gram::sequence_hash<std::vector<unsigned int, std::allocator<unsigned int> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node (__code=3864292196, __k=std::vector of length 1, capacity 1 = {...}, __n=2, this=0x2b797278fcd8)
    at /usr/include/c++/7/bits/hashtable.h:1548
#1  std::_Hashtable<std::vector<unsigned int, std::allocator<unsigned int> >, std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long>, std::allocator<std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long> >, std::__detail::_Select1st, std::equal_to<std::vector<unsigned int, std::allocator<unsigned int> > >, gram::sequence_hash<std::vector<unsigned int, std::allocator<unsigned int> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_node (__c=3864292196, __key=std::vector of length 1, capacity 1 = {...}, __bkt=2, this=0x2b797278fcd8)
    at /usr/include/c++/7/bits/hashtable.h:642
#2  std::__detail::_Map_base<std::vector<unsigned int, std::allocator<unsigned int> >, std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long>, std::allocator<std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long> >, std::__detail::_Select1st, std::equal_to<std::vector<unsigned int, std::allocator<unsigned int> > >, gram::sequence_hash<std::vector<unsigned int, std::allocator<unsigned int> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[] (this=0x2b797278fcd8, __k=std::vector of length 1, capacity 1 = {...})
    at /usr/include/c++/7/bits/hashtable_policy.h:721
#3  0x00005647921b3488 in std::unordered_map<std::vector<unsigned int, std::allocator<unsigned int> >, unsigned long, gram::sequence_hash<std::vector<unsigned int, std::allocator<unsigned int> > >, std::equal_to<std::vector<unsigned int, std::allocator<unsigned int> > >, std::allocator<std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long> > >::operator[] (__k=std::vector of length 1, capacity 1 = {...}, this=<optimized out>) at /usr/include/c++/7/bits/unordered_map.h:973
#4  gram::coverage::record::grouped_allele_counts (coverage=..., search_states=std::__cxx11::list = {...})
    at /tmp/pip-44dd3srs-build/libgramtools/src/quasimap/coverage/grouped_allele_counts.cpp:54
#5  0x00005647921a6be6 in gram::coverage::record::search_states (coverage=..., search_states=..., read_length=@0x2b7973395d08: 100, prg_info=..., random_seed=<optimized out>)
    at /tmp/pip-44dd3srs-build/libgramtools/src/quasimap/coverage/common.cpp:212
#6  0x000056479219bfa4 in gram::quasimap_read (read=std::vector of length 100, capacity 100 = {...}, coverage=..., kmer_index=std::unordered_map with 3376442 elements = {...}, 
    prg_info=..., parameters=...) at /tmp/pip-44dd3srs-build/libgramtools/src/quasimap/quasimap.cpp:211
#7  0x000056479219c34d in handle_reads_buffer () at /tmp/pip-44dd3srs-build/libgramtools/src/quasimap/quasimap.cpp:181
#8  0x00002b792f4f995e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#9  0x00002b792f9316db in start_thread (arg=0x2b7973396700) at pthread_create.c:463
#10 0x00002b792fc6a88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

From ::operator[] inside gram::coverage::record::grouped_allele_counts this looks exactly like @leoisl 's bug find..

bricoletc avatar May 08 '19 09:05 bricoletc

Here is what happens under https://github.com/bricoletc/gramtools/commit/e66865ebdf6a14b56ac4ed4e6759a6509ddc9835 ; the instruction #pragma omp atomic is changed to #pragma omp critical in libgramtools/src/quasimap/coverage/grouped_allele_counts.cpp, now threads cannot update grouped coverage simulateneously. Per_Sample_CPU_Time_bars

Some samples take much more time: 803, PG0459-C. This could be followed up. Sample PG0463-C is where a segfault occurred prior to change, hence low CPU run time.

Time_diff There is an overall increase of ~5000 CPU seconds, which amounts to about 1h increased run time per quasimapping, compared to overall ~15hours. I have not seen segfaults yet, so this might be a good solution. We could do better by locking threads on each site's coverage structure; right now any coverage write is locked for all sites

bricoletc avatar May 10 '19 10:05 bricoletc

Hello,

wow, pretty nice and detailed benchmarking! I guess this should be addressed in the future, but for now it is important that your experiments aren't blocked and the slow down is small, so the tool is runnable. We should discuss quasimapping and how to improve mapping speed, together with this and possibly other stuff like scheduling policies, bloom filters filtering, etc...

leoisl avatar May 10 '19 12:05 leoisl