ParallelFold icon indicating copy to clipboard operation
ParallelFold copied to clipboard

failed to alloc 2147483648 bytes on host: CUDA_ERROR_OPERATING_SYSTEM: OS call failed or operation not supported on this OS

Open yanchenmochen opened this issue 2 years ago • 3 comments

When I use the code to compute T1050.fasta, which is composed of 700 residuses, the command line output the problem。 The Environment is GPU: A100, Ubuntu,but I use higher version jax and jaxlib, is it the problem causing this?

(parafold) root@node33-a100:~# pip list | grep jax jax 0.3.15 jaxlib 0.3.15+cuda11.cudnn82

yanchenmochen avatar Aug 16 '22 08:08 yanchenmochen

2022-08-17 11:26:20.226278: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:796] failed to alloc 12524123136 bytes on host: CUDA_ERROR_OPERATING_SYSTEM: OS call failed or operation not supported on this OS 2022-08-17 11:26:20.226316: W external/org_tensorflow/tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 12524123136 2022-08-17 11:26:23.693074: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:796] failed to alloc 11271710720 bytes on host: CUDA_ERROR_OPERATING_SYSTEM: OS call failed or operation not supported on this OS 2022-08-17 11:26:23.693112: W external/org_tensorflow/tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 11271710720 2022-08-17 11:26:28.900144: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:796] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OPERATING_SYSTEM: OS call failed or operation not supported on this OS 2022-08-17 11:26:28.900185: W external/org_tensorflow/tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 17179869184 2022-08-17 11:26:44.115027: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:796] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OPERATING_SYSTEM: OS call failed or operation not supported on this OS 2022-08-17 11:26:44.115072: W external/org_tensorflow/tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 17179869184

yanchenmochen avatar Aug 17 '22 03:08 yanchenmochen

I'm not sure about this. Maybe it's the jax version issue as you said, but I didn'tmet this before.

Zuricho avatar Aug 19 '22 06:08 Zuricho

I changed another Machine to Run Protein Prediction, I think Now It is correct now, Maybe It is the jaxlib causing the problem, but the Linux which is used by many staffs.

yanchenmochen avatar Aug 24 '22 05:08 yanchenmochen