ColabFold icon indicating copy to clipboard operation
ColabFold copied to clipboard

ColabFold fails when using directory with multiple .a3m as input (works for single input)

Open apcamargo opened this issue 2 years ago • 1 comments

I'm using ColabFold via localcolabfold in a g4dn.12xlarge AWS EC2 instance. I have a directory with thousands of MSAs (MSAs) in the .a3m format and used it as input for colabfold_batch. After a thousand or so MSAs were processed, I started to get the error below. The strange thing is that I don't have any errors if I process MSAs invididually, in multiple commands. Since colabfold_batch processes the MSAs sequentially, I don't think this is actually a memory issue.

colabfold_batch --zip --num-recycle 2 --num-models 1 --max-seq 512 --max-extra-seq 1024 MSAs ColabFold_prediction
2023-10-03 17:53:09,571 Running colabfold 1.5.2 (29cbf90390086c336afbe7c420eef9a3cf00451c)
2023-10-03 17:53:23,605 Running on GPU
2023-10-03 17:53:24,048 Found 4 citations for tools or databases
2023-10-03 17:53:24,048 Query 1/1665: NF006199 (length 335)
2023-10-03 17:53:31,299 Padding length to 345
2023-10-03 17:53:48.055148: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: INTERNAL: cublas error
2023-10-03 17:53:48,059 Could not predict NF006199. Not Enough GPU memory? INTERNAL: cublas error
2023-10-03 17:53:48,059 Query 2/1665: NF011736 (length 335)
2023-10-03 17:53:49,587 Padding length to 345
2023-10-03 17:53:49.592976: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:833] failed to record completion event; therefore, failed to create inter-stre
am dependency
2023-10-03 17:53:49.593136: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1179] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS
: an illegal memory access was encountered; GPU dst: 0x7ef32d812a00; host src: 0x7f0307200000; size: 4=0x4
2023-10-03 17:53:49.593170: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:321] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS:
an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2023-10-03 17:53:49.593190: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:614] unable to add host callback: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory acce
ss was encountered
2023-10-03 17:53:49.593357: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1179] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS
: an illegal memory access was encountered; GPU dst: 0x7ef32d812b00; host src: 0x7f0307200100; size: 4=0x4
2023-10-03 17:53:49.593368: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:321] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS:
an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2023-10-03 17:53:49.593389: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1032] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory
access was encountered
2023-10-03 17:53:49.593421: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:614] unable to add host callback: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory acce
ss was encountered
...

apcamargo avatar Oct 03 '23 18:10 apcamargo

I ended up facing the same issue with a MSA that was further in the queue (fixed it by reducing --max-seq). This MSA is not the same MSA that failed in the run using the whole directory as input.

Does colabfold_batch processes MSAs in advance?

apcamargo avatar Oct 03 '23 19:10 apcamargo