openfold icon indicating copy to clipboard operation
openfold copied to clipboard

Unusual predicted structures from pretrained OpenFold on Pascal GPU

Open epenning opened this issue 2 years ago • 10 comments

This is most likely some kind of local configuration error, but I haven't been able to pin down the cause. If anyone has encountered this behavior before or has an idea of what might be wrong based on these output structures, any hints would be greatly appreciated!

Expected behavior:

run_pretrained_openfold.py outputs predicted structures comparable to AlphaFold or OpenFold Colab output.

I expected a structure similar to this unrelaxed prediction from OpenFold Colab model_1 with finetuning_1.pt:
image

Actual behavior:

My run_pretrained_openfold.py predicted structures are not similar to AlphaFold or OpenFold Colab output.

Predictions from model_1 with finetuning_1.pt (unrelaxed in tan, relaxed in blue):
image

Predictions from model_1 with params_model_1.npz:
image

Predictions from model_1 with params_model_1.npz using alignments from ColabFold MMseqs2 (ColabFold had predicted a reasonable expected structure):
image

Context:

4 x NVidia 1080-TI GPUs Using CUDA 11.3 (if other system data is relevant I can find it)

input/short.fasta

>query
MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH

Run command:

python3 run_pretrained_openfold.py \
    input \
    data/pdb_mmcif/mmcif_files/ \
    --output_dir output \
    --cpus 16 \
    --preset reduced_dbs \
    --uniref90_database_path data/uniref90/uniref90.fasta \
    --mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
    --pdb70_database_path data/pdb70/pdb70 \
    --uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --model_device "cuda:0" \
    --jackhmmer_binary_path $venv_bin_dir/jackhmmer \
    --hhblits_binary_path $venv_bin_dir/hhblits \
    --hhsearch_binary_path $venv_bin_dir/hhsearch \
    --kalign_binary_path $venv_bin_dir/kalign \
    --config_preset "model_1" \
    --openfold_checkpoint_path openfold/resources/openfold_params/finetuning_1.pt

Other configurations I tried, which produced similarly strange outputs:

  • Removing --openfold_checkpoint_path to just use the AlphaFold weights
  • Using --config_preset "model_1_ptm" with finetuning_ptm_2.pt
  • Using --use_precomputed_alignments with alignment results from a previous OpenFold output
  • Using --use_precomputed_alignments with .a3m results from ColabFold
  • Using full_dbs instead of reduced_dbs

epenning avatar Jul 15 '22 18:07 epenning

Could you verify the git commit hash of your installation + that the git diff shows no modifications to the model code?

gahdritz avatar Jul 15 '22 19:07 gahdritz

I actually want to note that I installed OpenFold on another system, which did produce the expected prediction! The working system has 4 x NVidia Quadro RTX 5000. I'm going to be investigating further to find what difference between these systems (or installations) is causing one to behave incorrectly.

I'm at commit hash: 6da2cdafc902d423cf7c136d66fbe81484d2cd0a

Git diff is:

diff --git a/openfold/data/parsers.py b/openfold/data/parsers.py
index 5f0574c..b3ccc53 100644
--- a/openfold/data/parsers.py
+++ b/openfold/data/parsers.py
@@ -60,6 +60,8 @@ def parse_fasta(fasta_string: str) -> Tuple[Sequence[str], Sequence[str]]:
             descriptions.append(line[1:])  # Remove the '>' at the beginning.
             sequences.append("")
             continue
+        elif line.startswith("#"):
+            continue
         elif not line:
             continue  # Skip blank lines.
         sequences[index] += line
diff --git a/tests/test_data/sample_feats.pickle.gz b/tests/test_data/sample_feats.pickle.gz
deleted file mode 100755
index 68a0c00..0000000
Binary files a/tests/test_data/sample_feats.pickle.gz and /dev/null differ

If it makes a difference, I'm also using the AlphaFold databases I already had from my AlphaFold2 installation, instead of re-downloading them with the provided scripts. I thought that they should be the same, but I suppose it's possible that could be a problem.

epenning avatar Jul 18 '22 15:07 epenning

To narrow down the issue, I ran OpenFold predictions on both the bad-prediction and good-prediction systems using the .a3m file from MMseqs2 as the precomputed alignment. Here is what I discovered so far:

  • The GTX 1080-TI system still had bad predictions after reinstalling and re-downloading params (including reinstalling Miniconda). So, it doesn't seem like the issue was caused by any incidental issue with the downloaded OpenFold params/etc.
  • The Quadro RTX 5000 system still had good predictions even when using the exact installation and files used in the GTX 1080-TI system (the filesystem is shared).

So, at this point it looks like something about CUDA or the GPU used.

nvidia-smi output from the GTX 1080-TI system (creates bad OpenFold predictions):

Mon Jul 18 12:53:47 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:02:00.0 Off |                  N/A |
|  0%   33C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:03:00.0 Off |                  N/A |
|  0%   32C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  On   | 00000000:82:00.0 Off |                  N/A |
|  0%   30C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  On   | 00000000:83:00.0 Off |                  N/A |
|  0%   29C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvidia-smi output from the Quadro RTX 5000 system (creates good OpenFold predictions):

Mon Jul 18 12:54:21 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000     On   | 00000000:02:00.0 Off |                  Off |
|  0%   32C    P8     9W / 230W |      0MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 5000     On   | 00000000:03:00.0 Off |                  Off |
|  0%   32C    P8     7W / 230W |      0MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 5000     On   | 00000000:82:00.0 Off |                  Off |
|  0%   31C    P8     4W / 230W |      0MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Quadro RTX 5000     On   | 00000000:83:00.0 Off |                  Off |
|  0%   32C    P8     7W / 230W |      0MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

epenning avatar Jul 18 '22 18:07 epenning

Could you try disabling the custom CUDA kernels on the bad system? Specifically, make sure that all occurrences of use_memory_efficient_kernel and use_flash are set to False in the config.

gahdritz avatar Jul 18 '22 20:07 gahdritz

Since this is happening on the release tag v1.0.0, use_flash isn't present there yet.

Where in the config should I set use_memory_efficient_kernel, since it's not assigned there already?

epenning avatar Jul 18 '22 22:07 epenning

Ah my bad I never added it to the config. You'll have to disable use_memory_efficient_kernel manually in openfold/model/evoformer.py. There should only be one occurrence of it there; change the setting from not use_lma to False.

gahdritz avatar Jul 18 '22 23:07 gahdritz

Thanks! I changed that line from not use_lma to False and ran it again, but it still produced another bad prediction.

epenning avatar Jul 19 '22 15:07 epenning

Weird. I'm not really sure what to make of this. Do you think it's worth trying to install OF from scratch on the 1080 system as a sanity check?

gahdritz avatar Jul 19 '22 16:07 gahdritz

I tried a couple different re-installations on the 1080 system, here's an update on that:

  • I fully removed OF from the 1080 system and installed it from scratch, although again the predictions were still bad.
  • From the system that did work, I installed OF on the shared filesystem. When running that installation from the working system, I did get good predictions, but running the same installation (with the same parameters and inputs) from the 1080 system still produced bad structures.

Since then I also created an additional installation in the shared filesystem, for a newer commit than v1.0.0 (236c68650a2ffbd846e51498d8b719c2b834ca38). This installation works on the good system (Quadro RTX 5000), but instead of bad output on the 1080 system, it encounters an error: RuntimeError: CUDA error: no kernel image is available for execution on the device I'm not sure what the cause is, or if it's relevant or not to the bad structure predictions on v1.0.0.

Here's the full stack trace from the error that happened on the 1080 system for the newer installation.
INFO:/.../shared/openfold/run_pretrained_openfold.py:Loaded OpenFold parameters at /.../shared/openfold/openfold/resources/openfold_params/finetuning_ptm_2.pt...
INFO:/.../shared/openfold/run_pretrained_openfold.py:Using precomputed alignments for query at /.../msa_input...
INFO:/.../shared/openfold/run_pretrained_openfold.py:Running inference for query...
Traceback (most recent call last):
  File "/.../shared/openfold/run_pretrained_openfold.py", line 591, in <module>
    main(args)
  File "/.../shared/openfold/run_pretrained_openfold.py", line 439, in main
    out = run_model(model, processed_feature_dict, tag, args)
  File "/.../shared/openfold/run_pretrained_openfold.py", line 119, in run_model
    out = model(batch)
  File "/.../shared/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/.../shared/openfold/openfold/model/model.py", line 510, in forward
    _recycle=(num_iters > 1)
  File "/.../shared/openfold/openfold/model/model.py", line 367, in iteration
    _mask_trans=self.config._mask_trans,
  File "/.../shared/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/.../shared/openfold/openfold/model/evoformer.py", line 998, in forward
    m, z = b(m, z)
  File "/.../shared/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/.../shared/openfold/openfold/model/evoformer.py", line 519, in forward
    self.ckpt if torch.is_grad_enabled() else False,
  File "/.../shared/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/.../shared/openfold/openfold/model/msa.py", line 273, in forward
    flash_mask=mask,
  File "/.../shared/openfold/openfold/model/msa.py", line 125, in _chunk
    no_batch_dims=len(m.shape[:-2])
  File "/.../shared/openfold/openfold/utils/chunk_utils.py", line 299, in chunk_layer
    output_chunk = layer(**chunks)
  File "/.../shared/openfold/openfold/model/msa.py", line 108, in fn
    flash_mask=flash_mask,
  File "/.../shared/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/.../shared/openfold/openfold/model/primitives.py", line 488, in forward
    o = attention_core(q, k, v, *((biases + [None] * 2)[:2]))
  File "/.../shared/openfold/openfold/utils/kernel/attention_core.py", line 53, in forward
    o = torch.matmul(attention_logits, v) 
RuntimeError: CUDA error: no kernel image is available for execution on the device

Since I do have access to a system that is working, this isn't a high priority for me, but I figured this extra information could be useful in the future. There might just be something subtly wrong with the environment in the 1080 system that I wasn't able to identify.

epenning avatar Jul 27 '22 20:07 epenning

Thanks for the diagnostics. Weird that it's crashing on a stock PyTorch matmul and not the custom kernel immediately thereafter... Equally weird is that this is happening halfway through the network, after many other stock PyTorch matmuls. Is there something wrong with the way I install the custom kernel perhaps?

This isn't the only Pascal GPU that's given OpenFold trouble, so it looks to me like the issue isn't just with your setup in particular. Since I can't reproduce this ATM, I'll let this sit, but I'll leave it open in case the problem resurfaces somewhere. I'll also update the issue title to localize future discussions here.

gahdritz avatar Jul 28 '22 19:07 gahdritz

I've confirmed that this was resolved with the fix for https://github.com/aqlaboratory/openfold/issues/172! 👏

epenning avatar Aug 29 '22 22:08 epenning