alphafold
alphafold copied to clipboard
Jackhammer / stockholm.c "No space left on device"
Related to https://github.com/deepmind/alphafold/issues/280.
We are getting a disk write error after Jackhammer completes:
I0613 00:13:04.128991 139887425492800 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0613 01:04:43.969136 139887425492800 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 3099.840 seconds
Traceback (most recent call last):
File "/app/alphafold/run_alphafold.py", line 432, in <module>
app.run(main)
File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/app/alphafold/run_alphafold.py", line 408, in main
predict_structure(
File "/app/alphafold/run_alphafold.py", line 172, in predict_structure
feature_dict = data_pipeline.process(
File "/app/alphafold/alphafold/data/pipeline.py", line 163, in process
jackhmmer_uniref90_result = run_msa_tool(
File "/app/alphafold/alphafold/data/pipeline.py", line 94, in run_msa_tool
result = msa_runner.query(input_fasta_path, max_sto_sequences)[0] # pytype: disable=wrong-arg-count
File "/app/alphafold/alphafold/data/tools/jackhmmer.py", line 170, in query
return self.query_multiple([input_fasta_path], max_sequences)[0]
File "/app/alphafold/alphafold/data/tools/jackhmmer.py", line 181, in query_multiple
single_chunk_results.append([self._query_chunk(
File "/app/alphafold/alphafold/data/tools/jackhmmer.py", line 142, in _query_chunk
raise RuntimeError(
RuntimeError: Jackhmmer failed
stderr:
Fatal exception (source file esl_msafile_stockholm.c, line 1278):
stockholm msa write failed
system error: No space left on device
After thousands of successful AF2 runs on our infrastructure, this has only occurred with the following protein input (2273AA):
>sequence_0
MGFVRQIQLLLWKNWTLRKRQKIRFVVELVWPLSLFLVLIWLRNANPLYSHHECHFPNKAMPSAGMLPWLQGIFCNVNNPCFQSPTPGESPGIVSNYNNSILARVYRDFQELLMNAPESQHLGRIWTELHILSQFMDTLRTHPERIAGRGIRIRDILKDEETLTLFLIKNIGLSDSVVYLLINSQVRPEQFAHGVPDLALKDIACSEALLERFIIFSQRRGAKTVRYALCSLSQGTLQWIEDTLYANVDFFKLFRVLPTLLDSRSQGINLRSWGGILSDMSPRIQEFIHRPSMQDLLWVTRPLMQNGGPETFTKLMGILSDLLCGYPEGGGSRVLSFNWYEDNNYKAFLGIDSTRKDPIYSYDRRTTSFCNALIQSLESNPLTKIAWRAAKPLLMGKILYTPDSPAARRILKNANSTFEELEHVRKLVKAWEEVGPQIWYFFDNSTQMNMIRDTLGNPTVKDFLNRQLGEEGITAEAILNFLYKGPRESQADDMANFDWRDIFNITDRTLRLVNQYLECLVLDKFESYNDETQLTQRALSLLEENMFWAGVVFPDMYPWTSSLPPHVKYKIRMDIDVVEKTNKIKDRYWDSGPRADPVEDFRYIWGGFAYLQDMVEQGITRSQVQAEAPVGIYLQQMPYPCFVDDSFMIILNRCFPIFMVLAWIYSVSMTVKSIVLEKELRLKETLKNQGVSNAVIWCTWFLDSFSIMSMSIFLLTIFIMHGRILHYSDPFILFLFLLAFSTATIMLCFLLSTFFSKASLAAACSGVIYFTLYLPHILCFAWQDRMTAELKKAVSLLSPVAFGFGTEYLVRFEEQGLGLQWSNIGNSPTEGDEFSFLLSMQMMLLDAAVYGLLAWYLDQVFPGDYGTPLPWYFLLQESYWLGGEGCSTREERALEKTEPLTEETEDPEHPEGIHDSFFEREHPGWVPGVCVKNLVKIFEPCGRPAVDRLNITFYENQITAFLGHNGAGKTTTLSILTGLLPPTSGTVLVGGRDIETSLDAVRQSLGMCPQHNILFHHLTVAEHMLFYAQLKGKSQEEAQLEMEAMLEDTGLHHKRNEEAQDLSGGMQRKLSVAIAFVGDAKVVILDEPTSGVDPYSRRSIWDLLLKYRSGRTIIMSTHHMDEADLLGDRIAIIAQGRLYCSGTPLFLKNCFGTGLYLTLVRKMKNIQSQRKGSEGTCSCSSKGFSTTCPAHVDDLTPEQVLDGDVNELMDVVLHHVPEAKLVECIGQELIFLLPNKNFKHRAYASLFRELEETLADLGLSSFGISDTPLEEIFLKVTEDSDSGPLFAGGAQQKRENVNPRHPCLGPREKAGQTPQDSNVCSPGAPAAHPEGQPPPEPECPGPQLNTGTQLVLQHVQALLVKRFQHTIRSHKDFLAQIVLPATFVFLALMLSIVIPPFGEYPALTLHPWIYGQQYTFFSMDEPGSEQFTVLADVLLNKPGFGNRCLKEGWLPEYPCGNSTPWKTPSVSPNITQLFQKQKWTQVNPSPSCRCSTREKLTMLPECPEGAGGLPPPQRTQRSTEILQDLTDRNISDFLVKTYPALIRSSLKSKFWVNEQRYGGISIGGKLPVVPITGEALVGFLSDLGRIMNVSGGPITREASKEIPDFLKHLETEDNIKVWFNNKGWHALVSFLNVAHNAILRASLPKDRSPEEYGITVISQPLNLTKEQLSEITVLTTSVDAVVAICVIFSMSFVPASFVLYLIQERVNKSKHLQFISGVSPTTYWVTNFLWDIMNYSVSAGLVVGIFIGFQKKAYTSPENLPALVALLLLYGWAVIPMMYPASFLFDVPSTAYVALSCANLFIGINSSAITFILELFENNRTLLRFNAVLRKLLIVFPHFCLGRGLIDLALSQAVTDVYARFGEEHSANPFHWDLIGKNLFAMVVEGVVYFLLTLLVQRHFFLSQWIAEPTKEPIVDEDDDVAEERQRIITGGNKTDILRLHELTKIYPGTSSPAVDRLCVGVRPGECFGLLGVNGAGKTTTFKMLTGDTTVTSGDATVAGKSILTNISEVHQNMGYCPQFDAIDELLTGREHLYLYARLRGVPAEEIEKVANWSIKSLGLTVYADCLAGTYSGGNKRKLSTAIALIGCPPLVLLDEPTTGMDPQARRMLWNVIVSIIREGRAVVLTSHSMEECEALCTRLAIMVKGAFRCMGTIQHLKSKFGDGYIVTMKIKSPKDDLLPDLNPVEQFFQGNFPGSVQRERHYNMLQFQVSSSSLARIFQLLLSHKDSLLIEEYSVTQTTLDQVFVNFAKQQTESHDLPLHPRAAGASRQAQD
According to the docker container, we have 86G of disk available at runtime for the tmp
directory:
ubuntu@ip-0A000007:~$ docker exec -it 99eaf4da5086 /bin/bash
I have no name!@99eaf4da5086:/mnt/pulsar/files/staging/6489372/working$ df -h /tmp
Filesystem Size Used Avail Use% Mounted on
overlay 124G 39G 86G 32% /
I have no name!@99eaf4da5086:/mnt/pulsar/files/staging/6489372/working$ exit
exit
ubuntu@ip-0A000007:~$ df -h /mnt/scratch/
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 63G 53M 60G 1% /mnt
Furthermore, I don't see any sign of disk filling up when I run watch df -h
on the host. There is plenty of disk available on the root partition, where both /var/lib/docker
and /tmp
are located.
It is possible that this could be a bug in AlphaFold. Any help would be greatly appreciated!
I am getting this as well, have been unable to resolve
I experienced that for some protein sequences (zinc fingers) the raw MSA size (before truncation/filtering) can be > 100 GB. The out-of-memory probably occurs when the sequences from a HHblits/Jackhmmer job are transferred from RAM to the (temporary) file. The increase of memory probably occurs very fast (depending on the transfer rate from RAM to disk) and might be difficult to track with watch df. If the MSA size is the cause, then the RAM usage of the HHblits/Jackhmmer process should be already very large (up to 100 GB).