alphafold
alphafold copied to clipboard
Weird Inference Behavior with GPU: Sometimes it works or Segmentation Fault/Filesystem Space Error
Hello all, I am encountering inconsistent behavior during GPU inference. Sometimes the inference runs successfully, but other times it fails with either:
- Segmentation fault
or
- Non-OK-status: has_executable.status() status: INTERNAL: ptxas exited with non-zero error code 132, output: : If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.Failure occured when compiling fusion gemm_fusion_dot.65403 with config '{block_m:32,block_n:64,block_k:32,split_k:1,num_stages:4,num_warps:4,num_ctas:1}'
Environment:
- Docker image as provided or local install
- GPU: RTX A6000 48GB CUDA, Driver Version: 535.183.01 CUDA Version: 12.2, Default run mode
- Ubuntu 24.03
- 128 GB RAM
- 6 TB Disk
- CPU: Intel i9-14900K (32) @ 5.700GHz
Steps to Reproduce:
Flags used:
'TF_FORCE_UNIFIED_MEMORY': '1',
'XLA_PYTHON_CLIENT_MEM_FRACTION': '4.0'
-
Build the Docker image: build -f docker/Dockerfile -t alphafold
-
Run the Docker container:
python3 docker/run_docker.py \
--fasta_paths=/home/juliocesar/DataZ/Models/WP_277336079.1.fasta \
--max_template_date=2020-05-14 \
--model_preset=monomer \
--data_dir=/home/juliocesar/DataZ/Models/alphafold_all_data \
--output_dir=/home/juliocesar/DataZ/Models/output \
--enable_gpu_relax=false
- The run may succeed and only use about 2GB of vRAM, and the results look fine.
- If I run another inference, I encounter either:
Segmentation Fault:
I0802 00:55:45.666194 127875129517888 run_docker.py:262] Fatal Python error: Segmentation fault
I0802 00:55:45.666548 127875129517888 run_docker.py:262]
I0802 00:55:45.666624 127875129517888 run_docker.py:262] Thread 0x000070bc94f6b280 (most recent call first):
I0802 00:55:45.666688 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 238 in backend_compile
I0802 00:55:45.666738 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/profiler.py", line 335 in wrapper
I0802 00:55:45.667106 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 500 in _compile_and_write_cache
I0802 00:55:45.667136 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 333 in compile_or_get_cached
I0802 00:55:45.667161 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2718 in _cached_compilation
I0802 00:55:45.667187 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2908 in from_hlo
I0802 00:55:45.667212 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2369 in compile
I0802 00:55:45.667233 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1406 in _pjit_call_impl_python
I0802 00:55:45.667258 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1471 in call_impl_cache_miss
I0802 00:55:45.667283 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1488 in _pjit_call_impl
I0802 00:55:45.667304 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 913 in process_primitive
I0802 00:55:45.667324 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 425 in bind_with_trace
I0802 00:55:45.667344 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 2788 in bind
I0802 00:55:45.667364 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 176 in _python_pjit_helper
I0802 00:55:45.667383 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 298 in cache_miss
I0802 00:55:45.667402 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/traceback_util.py", line 179 in reraise_with_filtered_traceback
I0802 00:55:45.667421 127875129517888 run_docker.py:262] File "/app/alphafold/alphafold/model/model.py", line 167 in predict
I0802 00:55:45.667440 127875129517888 run_docker.py:262] File "/app/alphafold/run_alphafold.py", line 284 in predict_structure
I0802 00:55:45.667459 127875129517888 run_docker.py:262] File "/app/alphafold/run_alphafold.py", line 543 in main
I0802 00:55:45.667478 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 258 in _run_main
I0802 00:55:45.667497 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 312 in run
I0802 00:55:45.667516 127875129517888 run_docker.py:262] File "/app/alphafold/run_alphafold.py", line 570 in <module>
Fatal Python error: Aborted: If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided
I0802 01:34:50.745555 132592778102592 run_docker.py:263] 2024-08-02 01:34:50.745063: F external/xla/xla/service/gpu/gemm_fusion_autotuner.cc:780] Non-OK-status: has_executable.status() status: INTERNAL: ptxas exited with non-zero error code 139, output: : If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.Failure occured when compiling fusion gemm_fusion_dot.52354 with config '{block_m:16,block_n:16,block_k:256,split_k:1,num_stages:1,num_warps:4,num_ctas:1}'
I0802 01:34:50.745778 132592778102592 run_docker.py:263] Fused HLO computation:
I0802 01:34:50.745835 132592778102592 run_docker.py:263] %gemm_fusion_dot.52354_computation (parameter_0.92: f32[17,384], parameter_1.92: f32[384], parameter_2.28: f32[384,384]) -> f32[17,384] {
I0802 01:34:50.745885 132592778102592 run_docker.py:263] %parameter_0.92 = f32[17,384]{1,0} parameter(0)
I0802 01:34:50.745933 132592778102592 run_docker.py:263] %parameter_1.92 = f32[384]{0} parameter(1)
I0802 01:34:50.745979 132592778102592 run_docker.py:263] %broadcast.15023 = f32[17,384]{1,0} broadcast(f32[384]{0} %parameter_1.92), dimensions={1}, metadata={op_name="jit(apply_fn)/jit(main)/alphafold/alphafold_iteration/structure_module/single_layer_norm/single_layer_norm/add" source_file="/app/alphafold/alphafold/model/common_modules.py" source_line=185}
I0802 01:34:50.746032 132592778102592 run_docker.py:263] %add.12065 = f32[17,384]{1,0} add(f32[17,384]{1,0} %parameter_0.92, f32[17,384]{1,0} %broadcast.15023), metadata={op_name="jit(apply_fn)/jit(main)/alphafold/alphafold_iteration/structure_module/single_layer_norm/single_layer_norm/add" source_file="/app/alphafold/alphafold/model/common_modules.py" source_line=185}
I0802 01:34:50.746080 132592778102592 run_docker.py:263] %parameter_2.28 = f32[384,384]{1,0} parameter(2)
I0802 01:34:50.746122 132592778102592 run_docker.py:263] ROOT %dot.3542 = f32[17,384]{1,0} dot(f32[17,384]{1,0} %add.12065, f32[384,384]{1,0} %parameter_2.28), lhs_contracting_dims={1}, rhs_contracting_dims={0}, metadata={op_name="jit(apply_fn)/jit(main)/alphafold/alphafold_iteration/structure_module/initial_projection/...a, ah->...h/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=float32]" source_file="/app/alphafold/alphafold/model/common_modules.py" source_line=122}
I0802 01:34:50.746166 132592778102592 run_docker.py:263] }
I0802 01:34:50.746207 132592778102592 run_docker.py:263] Fatal Python error: Aborted
I0802 01:34:50.746250 132592778102592 run_docker.py:263]
I0802 01:34:50.746290 132592778102592 run_docker.py:263] Thread 0x00007874dcb35280 (most recent call first):
I0802 01:34:50.746330 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 238 in backend_compile
I0802 01:34:50.746370 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/profiler.py", line 335 in wrapper
I0802 01:34:50.746411 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 500 in _compile_and_write_cache
I0802 01:34:50.746460 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 333 in compile_or_get_cached
I0802 01:34:50.746501 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2718 in _cached_compilation
I0802 01:34:50.746541 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2908 in from_hlo
I0802 01:34:50.746581 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2369 in compile
I0802 01:34:50.746620 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1406 in _pjit_call_impl_python
I0802 01:34:50.746675 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1471 in call_impl_cache_miss
I0802 01:34:50.746716 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1488 in _pjit_call_impl
I0802 01:34:50.746762 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 913 in process_primitive
I0802 01:34:50.747026 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 425 in bind_with_trace
I0802 01:34:50.747194 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 2788 in bind
I0802 01:34:50.747266 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 176 in _python_pjit_helper
I0802 01:34:50.747329 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 298 in cache_miss
I0802 01:34:50.747386 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/traceback_util.py", line 179 in reraise_with_filtered_traceback
I0802 01:34:50.747431 132592778102592 run_docker.py:263] File "/app/alphafold/alphafold/model/model.py", line 167 in predict
I0802 01:34:50.747478 132592778102592 run_docker.py:263] File "/app/alphafold/run_alphafold.py", line 284 in predict_structure
I0802 01:34:50.747540 132592778102592 run_docker.py:263] File "/app/alphafold/run_alphafold.py", line 543 in main
I0802 01:34:50.747584 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 258 in _run_main
I0802 01:34:50.747641 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 312 in run
I0802 01:34:50.747692 132592778102592 run_docker.py:263] File "/app/alphafold/run_alphafold.py", line 570 in <module>
Successful inference:
I0802 12:20:21.280953 130830089832256 run_docker.py:262] I0802 12:20:21.280649 126517922906752 run_alphafold.py:276] Running model model_1_pred_0 on WP_277325438.1
I0802 12:20:22.182663 130830089832256 run_docker.py:262] 2024-08-02 12:20:22.182248: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
I0802 12:20:22.435071 130830089832256 run_docker.py:262] I0802 12:20:22.434748 126517922906752 model.py:165] Running predict with shape(feat) = {'aatype': (4, 57), 'residue_index': (4, 57), 'seq_length': (4,), 'template_aatype': (4, 4, 57), 'template_all_atom_masks': (4, 4, 57, 37), 'template_all_atom_positions': (4, 4, 57, 37, 3), 'template_sum_probs': (4, 4, 1), 'is_distillation': (4,), 'seq_mask': (4, 57), 'msa_mask': (4, 508, 57), 'msa_row_mask': (4, 508), 'random_crop_to_size_seed': (4, 2), 'template_mask': (4, 4), 'template_pseudo_beta': (4, 4, 57, 3), 'template_pseudo_beta_mask': (4, 4, 57), 'atom14_atom_exists': (4, 57, 14), 'residx_atom14_to_atom37': (4, 57, 14), 'residx_atom37_to_atom14': (4, 57, 37), 'atom37_atom_exists': (4, 57, 37), 'extra_msa': (4, 5120, 57), 'extra_msa_mask': (4, 5120, 57), 'extra_msa_row_mask': (4, 5120), 'bert_mask': (4, 508, 57), 'true_msa': (4, 508, 57), 'extra_has_deletion': (4, 5120, 57), 'extra_deletion_value': (4, 5120, 57), 'msa_feat': (4, 508, 57, 49), 'target_feat': (4, 57, 22)}
I0802 12:21:06.842577 130830089832256 run_docker.py:262] I0802 12:21:06.842006 126517922906752 model.py:175] Output shape was {'distogram': {'bin_edges': (63,), 'logits': (57, 57, 64)}, 'experimentally_resolved': {'logits': (57, 37)}, 'masked_msa': {'logits': (508, 57, 23)}, 'predicted_lddt': {'logits': (57, 50)}, 'structure_module': {'final_atom_mask': (57, 37), 'final_atom_positions': (57, 37, 3)}, 'plddt': (57,), 'ranking_confidence': ()}
I0802 12:21:06.842761 130830089832256 run_docker.py:262] I0802 12:21:06.842111 126517922906752 run_alphafold.py:288] Total JAX model model_1_pred_0 on WP_277325438.1 predict time (includes compilation time, see --benchmark): 44.4s
I0802 12:21:06.867249 130830089832256 run_docker.py:262] I0802 12:21:06.866905 126517922906752 run_alphafold.py:276] Running model model_2_pred_0 on WP_277325438.1
I0802 12:21:07.528902 130830089832256 run_docker.py:262] I0802 12:21:07.528476 126517922906752 model.py:165] Running predict with shape(feat) = {'aatype': (4, 57), 'residue_index': (4, 57), 'seq_length': (4,), 'template_aatype': (4, 4, 57), 'template_all_atom_masks': (4, 4, 57, 37), 'template_all_atom_positions': (4, 4, 57, 37, 3), 'template_sum_probs': (4, 4, 1), 'is_distillation': (4,), 'seq_mask': (4, 57), 'msa_mask': (4, 508, 57), 'msa_row_mask': (4, 508), 'random_crop_to_size_seed': (4, 2), 'template_mask': (4, 4), 'template_pseudo_beta': (4, 4, 57, 3), 'template_pseudo_beta_mask': (4, 4, 57), 'atom14_atom_exists': (4, 57, 14), 'residx_atom14_to_atom37': (4, 57, 14), 'residx_atom37_to_atom14': (4, 57, 37), 'atom37_atom_exists': (4, 57, 37), 'extra_msa': (4, 1024, 57), 'extra_msa_mask': (4, 1024, 57), 'extra_msa_row_mask': (4, 1024), 'bert_mask': (4, 508, 57), 'true_msa': (4, 508, 57), 'extra_has_deletion': (4, 1024, 57), 'extra_deletion_value': (4, 1024, 57), 'msa_feat': (4, 508, 57, 49), 'target_feat': (4, 57, 22)}
I0802 12:21:45.562296 130830089832256 run_docker.py:262] I0802 12:21:45.561828 126517922906752 model.py:175] Output shape was {'distogram': {'bin_edges': (63,), 'logits': (57, 57, 64)}, 'experimentally_resolved': {'logits': (57, 37)}, 'masked_msa': {'logits': (508, 57, 23)}, 'predicted_lddt': {'logits': (57, 50)}, 'structure_module': {'final_atom_mask': (57, 37), 'final_atom_positions': (57, 37, 3)}, 'plddt': (57,), 'ranking_confidence': ()}
I0802 12:21:45.562419 130830089832256 run_docker.py:262] I0802 12:21:45.561898 126517922906752 run_alphafold.py:288] Total JAX model model_2_pred_0 on WP_277325438.1 predict time (includes compilation time, see --benchmark): 38.0s
I0802 12:21:45.576066 130830089832256 run_docker.py:262] I0802 12:21:45.575660 126517922906752 run_alphafold.py:276] Running model model_3_pred_0 on WP_277325438.1
I0802 12:21:46.514541 130830089832256 run_docker.py:262] I0802 12:21:46.514255 126517922906752 model.py:165] Running predict with shape(feat) = {'aatype': (4, 57), 'residue_index': (4, 57), 'seq_length': (4,), 'is_distillation': (4,), 'seq_mask': (4, 57), 'msa_mask': (4, 512, 57), 'msa_row_mask': (4, 512), 'random_crop_to_size_seed': (4, 2), 'atom14_atom_exists': (4, 57, 14), 'residx_atom14_to_atom37': (4, 57, 14), 'residx_atom37_to_atom14': (4, 57, 37), 'atom37_atom_exists': (4, 57, 37), 'extra_msa': (4, 5120, 57), 'extra_msa_mask': (4, 5120, 57), 'extra_msa_row_mask': (4, 5120), 'bert_mask': (4, 512, 57), 'true_msa': (4, 512, 57), 'extra_has_deletion': (4, 5120, 57), 'extra_deletion_value': (4, 5120, 57), 'msa_feat': (4, 512, 57, 49), 'target_feat': (4, 57, 22)}
I0802 12:22:18.854233 130830089832256 run_docker.py:262] I0802 12:22:18.853717 126517922906752 model.py:175] Output shape was {'distogram': {'bin_edges': (63,), 'logits': (57, 57, 64)}, 'experimentally_resolved': {'logits': (57, 37)}, 'masked_msa': {'logits': (512, 57, 23)}, 'predicted_lddt': {'logits': (57, 50)}, 'structure_module': {'final_atom_mask': (57, 37), 'final_atom_positions': (57, 37, 3)}, 'plddt': (57,), 'ranking_confidence': ()}
I0802 12:22:18.854382 130830089832256 run_docker.py:262] I0802 12:22:18.853795 126517922906752 run_alphafold.py:288] Total JAX model model_3_pred_0 on WP_277325438.1 predict time (includes compilation time, see --benchmark): 32.3s
I0802 12:22:18.868918 130830089832256 run_docker.py:262] I0802 12:22:18.868620 126517922906752 run_alphafold.py:276] Running model model_4_pred_0 on WP_277325438.1
I0802 12:22:19.415913 130830089832256 run_docker.py:262] I0802 12:22:19.415531 126517922906752 model.py:165] Running predict with shape(feat) = {'aatype': (4, 57), 'residue_index': (4, 57), 'seq_length': (4,), 'is_distillation': (4,), 'seq_mask': (4, 57), 'msa_mask': (4, 512, 57), 'msa_row_mask': (4, 512), 'random_crop_to_size_seed': (4, 2), 'atom14_atom_exists': (4, 57, 14), 'residx_atom14_to_atom37': (4, 57, 14), 'residx_atom37_to_atom14': (4, 57, 37), 'atom37_atom_exists': (4, 57, 37), 'extra_msa': (4, 5120, 57), 'extra_msa_mask': (4, 5120, 57), 'extra_msa_row_mask': (4, 5120), 'bert_mask': (4, 512, 57), 'true_msa': (4, 512, 57), 'extra_has_deletion': (4, 5120, 57), 'extra_deletion_value': (4, 5120, 57), 'msa_feat': (4, 512, 57, 49), 'target_feat': (4, 57, 22)}
I0802 12:22:49.981101 130830089832256 run_docker.py:262] I0802 12:22:49.980713 126517922906752 model.py:175] Output shape was {'distogram': {'bin_edges': (63,), 'logits': (57, 57, 64)}, 'experimentally_resolved': {'logits': (57, 37)}, 'masked_msa': {'logits': (512, 57, 23)}, 'predicted_lddt': {'logits': (57, 50)}, 'structure_module': {'final_atom_mask': (57, 37), 'final_atom_positions': (57, 37, 3)}, 'plddt': (57,), 'ranking_confidence': ()}
I0802 12:22:49.981184 130830089832256 run_docker.py:262] I0802 12:22:49.980788 126517922906752 run_alphafold.py:288] Total JAX model model_4_pred_0 on WP_277325438.1 predict time (includes compilation time, see --benchmark): 30.6s
I0802 12:22:49.994777 130830089832256 run_docker.py:262] I0802 12:22:49.994589 126517922906752 run_alphafold.py:276] Running model model_5_pred_0 on WP_277325438.1
I0802 12:22:50.547898 130830089832256 run_docker.py:262] I0802 12:22:50.547599 126517922906752 model.py:165] Running predict with shape(feat) = {'aatype': (4, 57), 'residue_index': (4, 57), 'seq_length': (4,), 'is_distillation': (4,), 'seq_mask': (4, 57), 'msa_mask': (4, 512, 57), 'msa_row_mask': (4, 512), 'random_crop_to_size_seed': (4, 2), 'atom14_atom_exists': (4, 57, 14), 'residx_atom14_to_atom37': (4, 57, 14), 'residx_atom37_to_atom14': (4, 57, 37), 'atom37_atom_exists': (4, 57, 37), 'extra_msa': (4, 1024, 57), 'extra_msa_mask': (4, 1024, 57), 'extra_msa_row_mask': (4, 1024), 'bert_mask': (4, 512, 57), 'true_msa': (4, 512, 57), 'extra_has_deletion': (4, 1024, 57), 'extra_deletion_value': (4, 1024, 57), 'msa_feat': (4, 512, 57, 49), 'target_feat': (4, 57, 22)}
I0802 12:23:20.811322 130830089832256 run_docker.py:262] I0802 12:23:20.810725 126517922906752 model.py:175] Output shape was {'distogram': {'bin_edges': (63,), 'logits': (57, 57, 64)}, 'experimentally_resolved': {'logits': (57, 37)}, 'masked_msa': {'logits': (512, 57, 23)}, 'predicted_lddt': {'logits': (57, 50)}, 'structure_module': {'final_atom_mask': (57, 37), 'final_atom_positions': (57, 37, 3)}, 'plddt': (57,), 'ranking_confidence': ()}
I0802 12:23:20.811408 130830089832256 run_docker.py:262] I0802 12:23:20.810804 126517922906752 run_alphafold.py:288] Total JAX model model_5_pred_0 on WP_277325438.1 predict time (includes compilation time, see --benchmark): 30.3s
I0802 12:23:21.161775 130830089832256 run_docker.py:262] I0802 12:23:21.161459 126517922906752 amber_minimize.py:178] alterations info: {'nonstandard_residues': [], 'removed_heterogens': set(), 'missing_residues': {}, 'missing_heavy_atoms': {}, 'missing_terminals': {<Residue 56 (GLY) of chain 0>: ['OXT']}, 'Se_in_MET': [], 'removed_chains': {0: []}}
I0802 12:23:21.182899 130830089832256 run_docker.py:262] I0802 12:23:21.182727 126517922906752 amber_minimize.py:408] Minimizing protein, attempt 1 of 100.
I0802 12:23:21.218353 130830089832256 run_docker.py:262] I0802 12:23:21.217987 126517922906752 amber_minimize.py:69] Restraining 384 / 761 particles.
I0802 12:23:21.510061 130830089832256 run_docker.py:262] I0802 12:23:21.509803 126517922906752 amber_minimize.py:178] alterations info: {'nonstandard_residues': [], 'removed_heterogens': set(), 'missing_residues': {}, 'missing_heavy_atoms': {}, 'missing_terminals': {}, 'Se_in_MET': [], 'removed_chains': {0: []}}
I0802 12:23:23.018579 130830089832256 run_docker.py:262] I0802 12:23:23.018234 126517922906752 amber_minimize.py:500] Iteration completed: Einit 25378.80 Efinal -497.05 Time 0.13 s num residue violations 0 num residue exclusions 0
I0802 12:23:23.084537 130830089832256 run_docker.py:262] I0802 12:23:23.084163 126517922906752 run_alphafold.py:414] Final timings for WP_277325438.1: {'features': 9.396395921707153, 'process_features_model_1_pred_0': 1.1538565158843994, 'predict_and_compile_model_1_pred_0': 44.40755605697632, 'process_features_model_2_pred_0': 0.6614584922790527, 'predict_and_compile_model_2_pred_0': 38.0334849357605, 'process_features_model_3_pred_0': 0.938460111618042, 'predict_and_compile_model_3_pred_0': 32.33959984779358, 'process_features_model_4_pred_0': 0.5467894077301025, 'predict_and_compile_model_4_pred_0': 30.56532311439514, 'process_features_model_5_pred_0': 0.5529048442840576, 'predict_and_compile_model_5_pred_0': 30.263261079788208, 'relax_model_1_pred_0': 2.210644483566284}
Expected Behavior:
- Inference should complete without errors.
- No segmentation faults or filesystem space errors as there is enough memory on the host and Docker has access to all resources.
Troubleshooting Steps Taken:
- Verified sufficient GPU memory is available.
- Checked Docker container for sufficient filesystem space.
- Monitored system memory usage to ensure it does not exceed limits.
- The problem persists even with reduced batch sizes on the model/config.py file
- Installed the model in the local machine using a Conda env to see if Docker was the issue but no luck, same behaviour.
- Tried using these flags, which has a effect on the amount of vRAM used but without apparent effect on the inference inconsistency:
'TF_FORCE_GPU_ALLOW_GROWTH': 'true',
'XLA_PYTHON_CLIENT_PREALLOCATE': 'false',
'XLA_PYTHON_CLIENT_MEM_FRACTION': '0.5',
Any guidance or suggestions for resolving these issues would be greatly appreciated.