Increase the number of threads for jackhmmer
Hi,
I have just realized that I was barking on the wrong tree to a degree.
Alphafold uses the system-default jackhmmer that becomes part of the docker package during the built stage. In docker/Dockerfile we see this:
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y
build-essential
cmake
cuda-command-line-tools-${CUDA/./-}
git
hmmer
kalign
tzdata
wget
&& rm -rf /var/lib/apt/lists/*
jackhmmer is a component of the hmmer package, which is maintained by the Eddy lab. They decide which version is best suitable for distribution.
Nevertheless, I am going to leave this issue/request open here. Perhaps, one day Alphafold will have its own version of jackhmmer better suitable for high IO SSD disks.
Thank you,
Petr Leiman
Hi, it's me again. The fact that jackhmmer uses so few threads does not make sense at all. I have just started two jobs in parallel - a 660 residue-long sequence and a 1400 residue-long sequence. jackhmmer used 4-5 threads for the shorter sequence and 8-9 for the longer. The longer sequence was done in about 70% if not 50% of the time of the shorter sequence. I bet the difference would be even more dramatic for a 300 residue sequence because jackhmmer would be using only 2 threads in that case. This does not make sense...
you can change the number of threads used by jackhmmer by increasing the number of cores here https://github.com/deepmind/alphafold/blob/c42a96f3a5b6179484b5f0b936e3dd0c9b08fde1/alphafold/data/tools/jackhmmer.py#L38 . It is 8 by default, you can increase this number.
Disclaimer: I am just a user
Hi, Please read my second message where the problem is described in greater detail. Of course, I am telling jackhmmer to use more than 8 default threads. So far, I have seen jackhmmer use more than 8 threads only a couple of times on sequences in the 4000+ residue range. On shorter sequences, jackhmmer uses 2-3 threads at most. Petr
Hi grandrea, Sorry for my brash response. You probably cannot see my pre-original post, which is replaced with whatever we see in the thread now. My jackhmmer is complied with n_cpu: int = 8. This parameter is largely ignored if it is greater than 8. jackhmmer will follow the instruction if that number is lower. But it will not use more than 2-3 threads on medium-length sequences. Petr
(To the best of my knowledge,) Jackhmmer scans all of the sequences in db, which means that there is > 100GB Read from storage for one search. I assume that is the reason why searches with shorter sequences only use 2-3 threads. The bottle neck is storage IO in that case. If queries are longer, as the application will use more CPU time for calculating alignments, storage IO doesn't disturb the process so much.