NetSurfP-3.0 icon indicating copy to clipboard operation
NetSurfP-3.0 copied to clipboard

CUDA out of memory for FASTA entries > 1022 residues

Open Magnushhoie opened this issue 1 year ago • 0 comments

The error occurs on GPU and the Biolib server (https://dtu.biolib.com/NetSurfP-3/) for FASTA files with sequences longer than the original ESM1b limit of ~1024 residues. It is reproducible on Biolib with as low as 60 sequences of 1900 residues each. At the same time, a massive input of 4900 sequences of 1000 residues works fine.

Current work-around: NOTE: If you get an out-of-memory error, remove all sequences above 1022 residues. Sequences above 1022 residues can be submitted up to 40 at at time. The bug occurs due to an unresolved Pytorch GPU memory handling bug. Alternatively, use the DTU Healthtech server, which uses CPU only: https://services.healthtech.dtu.dk/service.php?NetSurfP-3.0

Input file tests (uploaded here: https://github.com/Eryk96/NetSurfP-3.0/tree/main/healthtech/input_tests)

  • 4900 seqs x 1000 residues no problem (4.9M residues total)
  • 40 seqs x 1900 residues no problem (76k residues total)
  • 60x1900 residues: RuntimeError: CUDA out of memory. Tried to allocate 1.95 GiB (GPU 0; 14.56 GiB total capacity; 10.13 GiB already allocated; 1.83 GiB free; 11.79 GiB reserved in total by PyTorch)

I think the bug is related to this code: https://github.com/Eryk96/NetSurfP-3.0/blob/main/nsp3/nsp3/embeddings/esm1b.py#L76)

We overcome ESM-1bs limit of 1024 residues per sequence by separating longer sequences into chunks, predicting them, then concatenating them back into the original sequence length. My guess is that some CUDA object remains on the GPU between batches, eventually leading to out of memory errors. However, I cannot see that ANYTHING remains in the code.

Magnushhoie avatar Aug 30 '22 09:08 Magnushhoie