alphafold icon indicating copy to clipboard operation
alphafold copied to clipboard

Increase the number of threads for jackhmmer

Open Phage-structure-geek opened this issue 3 years ago • 5 comments

Hi,

I have just realized that I was barking on the wrong tree to a degree.

Alphafold uses the system-default jackhmmer that becomes part of the docker package during the built stage. In docker/Dockerfile we see this:

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y
build-essential
cmake
cuda-command-line-tools-${CUDA/./-}
git
hmmer
kalign
tzdata
wget
&& rm -rf /var/lib/apt/lists/*

jackhmmer is a component of the hmmer package, which is maintained by the Eddy lab. They decide which version is best suitable for distribution.

Nevertheless, I am going to leave this issue/request open here. Perhaps, one day Alphafold will have its own version of jackhmmer better suitable for high IO SSD disks.

Thank you,

Petr Leiman

Phage-structure-geek avatar Apr 24 '22 21:04 Phage-structure-geek

Hi, it's me again. The fact that jackhmmer uses so few threads does not make sense at all. I have just started two jobs in parallel - a 660 residue-long sequence and a 1400 residue-long sequence. jackhmmer used 4-5 threads for the shorter sequence and 8-9 for the longer. The longer sequence was done in about 70% if not 50% of the time of the shorter sequence. I bet the difference would be even more dramatic for a 300 residue sequence because jackhmmer would be using only 2 threads in that case. This does not make sense...

Phage-structure-geek avatar May 03 '22 02:05 Phage-structure-geek

you can change the number of threads used by jackhmmer by increasing the number of cores here https://github.com/deepmind/alphafold/blob/c42a96f3a5b6179484b5f0b936e3dd0c9b08fde1/alphafold/data/tools/jackhmmer.py#L38 . It is 8 by default, you can increase this number.

Disclaimer: I am just a user

grandrea avatar May 25 '22 14:05 grandrea

Hi, Please read my second message where the problem is described in greater detail. Of course, I am telling jackhmmer to use more than 8 default threads. So far, I have seen jackhmmer use more than 8 threads only a couple of times on sequences in the 4000+ residue range. On shorter sequences, jackhmmer uses 2-3 threads at most. Petr

Phage-structure-geek avatar May 25 '22 15:05 Phage-structure-geek

Hi grandrea, Sorry for my brash response. You probably cannot see my pre-original post, which is replaced with whatever we see in the thread now. My jackhmmer is complied with n_cpu: int = 8. This parameter is largely ignored if it is greater than 8. jackhmmer will follow the instruction if that number is lower. But it will not use more than 2-3 threads on medium-length sequences. Petr

Phage-structure-geek avatar Jun 20 '22 04:06 Phage-structure-geek

(To the best of my knowledge,) Jackhmmer scans all of the sequences in db, which means that there is > 100GB Read from storage for one search. I assume that is the reason why searches with shorter sequences only use 2-3 threads. The bottle neck is storage IO in that case. If queries are longer, as the application will use more CPU time for calculating alignments, storage IO doesn't disturb the process so much.

yamule avatar Jun 20 '22 10:06 yamule