ColabFold ColabFold fails when using directory with multiple .a3m as input (works for single input)

ColabFold fails when using directory with multiple .a3m as input (works for single input)

Open apcamargo opened this issue 2 years ago • 1 comments

I'm using ColabFold via localcolabfold in a g4dn.12xlarge AWS EC2 instance. I have a directory with thousands of MSAs (MSAs) in the .a3m format and used it as input for colabfold_batch. After a thousand or so MSAs were processed, I started to get the error below. The strange thing is that I don't have any errors if I process MSAs invididually, in multiple commands. Since colabfold_batch processes the MSAs sequentially, I don't think this is actually a memory issue.

colabfold_batch --zip --num-recycle 2 --num-models 1 --max-seq 512 --max-extra-seq 1024 MSAs ColabFold_prediction
2023-10-03 17:53:09,571 Running colabfold 1.5.2 (29cbf90390086c336afbe7c420eef9a3cf00451c)
2023-10-03 17:53:23,605 Running on GPU
2023-10-03 17:53:24,048 Found 4 citations for tools or databases
2023-10-03 17:53:24,048 Query 1/1665: NF006199 (length 335)
2023-10-03 17:53:31,299 Padding length to 345
2023-10-03 17:53:48.055148: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: INTERNAL: cublas error
2023-10-03 17:53:48,059 Could not predict NF006199. Not Enough GPU memory? INTERNAL: cublas error
2023-10-03 17:53:48,059 Query 2/1665: NF011736 (length 335)
2023-10-03 17:53:49,587 Padding length to 345
2023-10-03 17:53:49.592976: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:833] failed to record completion event; therefore, failed to create inter-stre
am dependency
2023-10-03 17:53:49.593136: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1179] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS
: an illegal memory access was encountered; GPU dst: 0x7ef32d812a00; host src: 0x7f0307200000; size: 4=0x4
2023-10-03 17:53:49.593170: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:321] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS:
an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2023-10-03 17:53:49.593190: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:614] unable to add host callback: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory acce
ss was encountered
2023-10-03 17:53:49.593357: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1179] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS
: an illegal memory access was encountered; GPU dst: 0x7ef32d812b00; host src: 0x7f0307200100; size: 4=0x4
2023-10-03 17:53:49.593368: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:321] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS:
an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2023-10-03 17:53:49.593389: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1032] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory
access was encountered
2023-10-03 17:53:49.593421: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:614] unable to add host callback: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory acce
ss was encountered
...

Oct 03 '23 18:10 apcamargo

I ended up facing the same issue with a MSA that was further in the queue (fixed it by reducing --max-seq). This MSA is not the same MSA that failed in the run using the whole directory as input.

Does colabfold_batch processes MSAs in advance?

Oct 03 '23 19:10 apcamargo

ColabFold ColabFold copied to clipboard

ColabFold fails when using directory with multiple .a3m as input (works for single input)

ColabFold
ColabFold copied to clipboard