sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

multithreads error in `sourmash sketch` - from numpy!?

Open shenwei356 opened this issue 2 years ago • 3 comments

I'd like to compute and index MinHash sketches on GTDB r202 representive genomes.

The sketching step (v4.2.1) is parallelized with 16 or 40 threads on a 160-cores machine. But some processes stopped unexpectedly with errors below, while using 8 threads had no problem.

OpenBLAS blas_thread_init: pthread_create failed for thread 1 of 128: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 4124051 max
OpenBLAS blas_thread_init: pthread_create failed for thread 2 of 128: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 4124051 max
OpenBLAS blas_thread_init: pthread_create failed for thread 3 of 128: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 4124051 max
....

OpenBLAS blas_thread_init: pthread_create failed for thread 125 of 128: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 4124051 max
OpenBLAS blas_thread_init: pthread_create failed for thread 126 of 128: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 4124051 max
OpenBLAS blas_thread_init: pthread_create failed for thread 127 of 128: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 4124051 max
...

OpenBLAS blas_thread_init: pthread_create failed for thread 18 of 128: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 4124051 max
Traceback (most recent call last):
  File "/home/shenwei/app/miniconda3/envs/kmcp/bin/sourmash", line 7, in <module>
    from sourmash.__main__ import main
  File "/home/shenwei/app/miniconda3/envs/kmcp/lib/python3.7/site-packages/sourmash/__init__.py", line 79, in <module>
    from .sbtmh import load_sbt_index as load_sbt_index_private
  File "/home/shenwei/app/miniconda3/envs/kmcp/lib/python3.7/site-packages/sourmash/sbtmh.py", line 4, in <module>
    from .sbt import Leaf, SBT, GraphFactory
  File "/home/shenwei/app/miniconda3/envs/kmcp/lib/python3.7/site-packages/sourmash/sbt.py", line 23, in <module>
    from .index import Index, IndexSearchResult, CollectionManifest
  File "/home/shenwei/app/miniconda3/envs/kmcp/lib/python3.7/site-packages/sourmash/index.py", line 43, in <module>
    from .search import make_jaccard_search_query, make_gather_query
  File "/home/shenwei/app/miniconda3/envs/kmcp/lib/python3.7/site-packages/sourmash/search.py", line 6, in <module>
    import numpy as np
  File "/home/shenwei/app/miniconda3/envs/kmcp/lib/python3.7/site-packages/numpy/__init__.py", line 145, in <module>
    from . import core
  File "/home/shenwei/app/miniconda3/envs/kmcp/lib/python3.7/site-packages/numpy/core/__init__.py", line 22, in <module>
    from . import multiarray
  File "/home/shenwei/app/miniconda3/envs/kmcp/lib/python3.7/site-packages/numpy/core/multiarray.py", line 12, in <module>
    from . import overrides
  File "/home/shenwei/app/miniconda3/envs/kmcp/lib/python3.7/site-packages/numpy/core/overrides.py", line 7, in <module>
    from numpy.core._multiarray_umath import (
KeyboardInterrupt
OpenBLAS blas_thread_init: pthread_create failed for thread 82 of 128: Resource temporarily unavailable
...

Note that KeyboardInterrupt may not be trigger by me, cause it occurred very early.

I notice that each sourmash process has a CPU usage of up to 300-800%.

But @luizirber said:

it is single-threaded, the parallel feature in the rust core is not activated by default. even when the parallel feature is activated, it is not very parallel, since it only adds each sequence in parallel to different sketches

Command

# uncompress
mkdir -p gtdb202
tar -zxvf gtdb_genomes_reps_r202.tar.gz -O gtdb202

# rename
brename -R -p '^(\w{3}_\d{9}\.\d+).+' -r '$1.fa.gz' gtdb202    

ls gtdb202/ | head -n 2
# GCA_000007325.1.fa.gz
# GCA_000008085.1.fa.gz

seqs=gtdb202
db=gtdb
k=31
threads=8
scale=1000

dbSOURMASHtmp=gtdb-sourmash-k$k-D$scale
dbSOURMASH=gtdb-sourmash-k$k-D$scale/_db.sbt.json
dbKMCPtmp=gtdb-kmcp-k$k-D$scale
dbKMCP=gtdb-kmcp-k$k-D$scale.db


# --------------- sourmash ---------------    
mkdir -p $dbSOURMASHtmp
indexSourmash() {
    find $seqs -name "*.fa.gz" \
        | rush -j $threads -v d=$dbSOURMASHtmp -v s=$scale -v k=$k \
            'sourmash -q sketch dna -p k={k},scaled={s} {} -o {d}/{%}.sig'     
    sourmash -q index $dbSOURMASH --from-file <(find $dbSOURMASHtmp -name "*.sig")
}

{ time indexSourmash ; } 2> $dbSOURMASH.time

Where

  • brename is for batching renaming files.
  • rush is for executing jobs in parallel.

shenwei356 avatar Aug 04 '21 05:08 shenwei356

ah, I was assuming it was the parallel feature that was causing issues, but it is numpy trying to parallelize... something. Of note, this happens during sourmash index, not sourmash sketch.

Can you check if setting these solve the problem?

export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1

(and, even tho the Rust parallel feature is not enable, you can also set RAYON_NUM_THREADS=1 just to be safe)

luizirber avatar Aug 04 '21 15:08 luizirber

Sorry, I did not make it clear, the error reported above is from running sourmash sketch, not index.

The environmental variables work, but it's kind of tricky for ordinary users. Each process has a CPU usage of about 30%.

shenwei356 avatar Aug 05 '21 00:08 shenwei356

Sorry, I did not make it clear, the error reported above is from running sourmash sketch, not index.

The environmental variables work, but it's kind of tricky for ordinary users. Each process has a CPU usage of about 30%.

I agree, but also not sure how to fix it on the Python side, since it is coming from one of the dependencies... There is probably some way to tell numpy to limit how many processes it uses, but I don't know it from the top of my head

luizirber avatar Aug 10 '21 04:08 luizirber