openfold
openfold copied to clipboard
Training Runtime Error: StopIteration
Hi,
I'm using the released training data on AWS and the latest main branch to train the model.
- The directory structure of the released data is not recognized by the code.
- After re-structuring the directories and put all the .hhr and .a3m under the alignment directory, the code crashes at
File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 377, in reroll datapoint_idx = next(samples)
with default settings.
Any idea to solve this?
Thanks,
Bo
The full trace back is as below:
Traceback (most recent call last):
File "train_openfold.py", line 548, in <module>
main(args)
File "train_openfold.py", line 341, in main
ckpt_path=ckpt_path,
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
self.fit_loop.run()
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 140, in run
self.on_run_start(*args, **kwargs)
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 197, in on_run_start
self.trainer.reset_train_val_dataloaders(self.trainer.lightning_module)
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 595, in reset_train_val_dataloaders
self.reset_train_dataloader(model=model)
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 365, in reset_train_dataloader
self.train_dataloader = self.request_dataloader(RunningStage.TRAINING, model=model)
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 611, in request_dataloader
dataloader = source.dataloader()
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 300, in dataloader
return method()
File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 694, in train_dataloader
return self._gen_dataloader("train")
File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 671, in _gen_dataloader
dataset.reroll()
File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 377, in reroll
datapoint_idx = next(samples)
StopIteration
srun: error: nid001680: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=2466693.0
Hi,
I dug a little deeper into this bug by add the exception handling:
try:
datapoint_idx = next(samples)
except StopIteration:
print("samples.length = {}, idx = {}".format(sum(1 for _ in samples), dataset_idx))
but the result shows that samples.length = 0, idx = 0
just FYI.
Bests,
Bo
Try increasing the length of the training epoch.
Try increasing the length of the training epoch.
So this means set --train_epoch_len
to a larger value?
Yes. As for the format of the AWS data, you'll need to reformat it slightly. For each chain, you'll want to flatten the hhr and a3m directories, such that the directory for each chain contains both the hhr files and the a3m files for that chain.
Yeah. I've done that. But I found some chains doesn't have all the 4 files(3 .a3m and 1 .hhr).
By the way, I've set the --train_epoch_len
to 80000 but still not working. If I set it to <=1200, it works but doesn't make sense.
Could you share your training command? Also, have you modified your config at all?
srun python3 train_openfold.py \
/pscratch/sd/b/bz186/openfold/data/pdb_mmcif/mmcif_files \
/pscratch/sd/b/bz186/openfold/data/alignment_openfold \
/pscratch/sd/b/bz186/openfold/data/pdb_mmcif/mmcif_files \
/pscratch/sd/b/bz186/openfold/data/train_full_output \
2021-10-10 \
--template_release_dates_cache_path=/pscratch/sd/b/bz186/openfold/data/mmcif_cache.json \
--precision=32 \
--gpus=4 \
--replace_sampler_ddp=True \
--seed=42 \
--deepspeed_config_path=/global/homes/b/bz186/openfold/deepspeed_config.json \
--checkpoint_every_epoch \
--obsolete_pdbs_file_path=/pscratch/sd/b/bz186/openfold/data/pdb_mmcif/obsolete.dat \
--train_chain_data_cache_path=/pscratch/sd/b/bz186/openfold/data/chain_data_cache.json \
--train_epoch_len=80000
I didn't modify anything compared to the latest main branch.
Is this being run on a single SLURM node, or multiple?
A single node with 4 GPUs , I did it on interactive mode.
Hm not able to reproduce. Could you share a sample of the directory structure of /pscratch/sd/b/bz186/openfold/data/alignment_openfold? Also, try placing a print statement in the if
block starting on line 333 of openfold/data/data_modules.py
, where it filters out certain chains using the chain data cache. Is every protein in your dataset getting filtered?
/pscratch/sd/b/bz186/openfold/data/alignment_openfold
-11as_A
- bfd_uniclust_hits.a3m
- mgnify_hits.a3m
- pdb70_hits.hhr
- uniref90_hits.a3m
-11ba_A
- bfd_uniclust_hits.a3m
- mgnify_hits.a3m
- pdb70_hits.hhr
- uniref90_hits.a3m
-11ba_B
-11bg_A
-11bg_B
-11gs_A
The directory tree is similar as above for all chains ,except some doesn't have all 4 files.
How about the filter thing (I edited my previous message)?
It prints some of the chain_data_cache_entry like this:
{'release_date': '2011-02-09', 'seq': 'MSAGKLPEGWVIAPVSTVTTLIRGVTYKKEQAINYLKDDYLPLIRANNIQNGKFDTTDLVFVPKNLVKESQKISPEDIVIAMSSGSKSVVGKSAHQHLPFECSFGAFCGVLRPEKLIFSGFIAHFTKSSLYRNKISSLSAGANINNIKPASFDLINIPIPPLAEQKIIAEKLDTLLAQVDSTKARFEQIPQILKRFRQAVLGGAVNGKLTEKWRNFEPQHSVFKKLNFESILTELRNGLSSKPNESGVGHPILRISSVRAGHVDQNDIRFLECSESELNRHKLQDGDLLFTRYNGSLEFVGVCGLLKKLQHQNLLYPDKLIRARLTKDALPEYIEIFFSSPSARNAMMNCVKTTSGQKGISGKDIKSQVVLLPPVKEQAEIVRRVEQLFAYADTIEKQVNNALARVNNLTQSILAKAFRGELTAQWRAENPDLISGENSAAALLEKIKAERAASGGKKASRKKS', 'resolution': 18.0, 'cluster_size': -1}
The count number of the if
block is 26 before it crashes.
Could you count how many of your proteins are getting filtered out by that function? I see for example that the resolution of this protein is very high---this one gets filtered for that reason.
The count number of the if
block is 26 before it crashes.
How about the probabilities that come out of the stochastic filter? Are those really small?
I print the p
right after
p = get_stochastic_train_filter_prob(
chain_data_cache_entry,
)
almost all of them are > 0.5
p = 1.0
p = 0.609375
p = 0.880859375
p = 0.5
p = 0.5
p = 0.5
p = 0.693359375
p = 0.5
p = 0.5
p = 0.5
p = 0.5
p = 0.6640625
p = 0.822265625
p = 0.001953125
p = 0.5
p = 0.5
p = 0.9140625
p = 0.5
p = 0.5390625
p = 0.52734375
p = 0.5
p = 0.5
p = 0.5
p = 0.51953125
p = 1.0
p = 0.53125
p = 0.5
p = 0.5
p = 0.5
p = 0.5
p = 0.5
Hm so that's not it either. Whatever the issue is, it's probably happening somewhere in that function. Would you mind pinpointing where the samples are disappearing?
I'm trying to do that. But we have some difficulties to understand the code, since we'are only computer science research and do not have required protein knowledge.
In the reroll() function, could you verify that len(self._samples) == 1
and also that torch.sum(dataset_choices) == 0
?
yes, they are. 1 and tensor(0).
since the self._samples
only has 1 element inside it, I tried to investigate the self._samples[0]
by doing print(sum(1 for _ in self._samples[0]))
. This time, it tells me that
Traceback (most recent call last):
File "train_openfold.py", line 548, in <module>
main(args)
File "train_openfold.py", line 262, in main
data_module.setup()
File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/core/datamodule.py", line 474, in wrapped_fn
fn(*args, **kwargs)
File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 654, in setup
_roll_at_init=False,
File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 358, in __init__
print(sum(1 for _ in self._samples[0]))
File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 358, in <genexpr>
print(sum(1 for _ in self._samples[0]))
File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 334, in looped_samples
chain_data_cache_entry = chain_data_cache[chain_id]
KeyError: '3wwy_A'
But when I check the alignment_dir, this chain does exist:
(openfold_venv) bz186@nid001520:~/openfold> ls /pscratch/sd/b/bz186/openfold/data/alignment_openfold | grep 3wwy_A
3wwy_A
(openfold_venv) bz186@nid001520:~/openfold> ls /pscratch/sd/b/bz186/openfold/data/alignment_openfold/3wwy_A
bfd_uniclust_hits.a3m mgnify_hits.a3m pdb70_hits.hhr uniref90_hits.a3m
I'm assuming there is something wrong with the file passed by --train_chain_data_cache_path
. But I did that pre-processing step using the full mmcif dataset downloaded by the script and the PDB40 in the README link. Is there any potential error that could happens given that the AWS dataset is claimed to have ~130,000 chains, however the full mmcif dataset has ~190,000 .cif
files
To clarify, self._samples
contains one infinite iterator for each dataset you're using. Each entry thereof feeds you samples from the corresponding dataset shuffled + filtered in various ways forever.
Yes, it's looking like something is wrong with your train chain data cache. It should contain an entry for every chain for which you have alignments. The total number of .cif files doesn't matter; what controls the size of the dataset objects is the alignment_dir
. I'll revisit the chain data cache script and see if I can find any errors. In the meantime, you can add a check in that function we were discussing earlier that skips chains that do not appear in your chain data cache.
I also ran into the same problem. I found 395 unique proteins in my msa_dir
that wasn't in my mmcif_files
dir and also in my chain_data_cache.json
. I removed all the chains for the 395 proteins, and the training works for me. I'm guessing there is a difference in the data used to generate the msa.
I'm also encountering this issue. Line 383, which is now the same line as what (in @llwx593's screenshot over in #113) was line 380. StopIteration exception.
My mmcif_dir looks like this:
100d.cif 101d.cif 101m.cif 102d.cif 102l.cif [...] 9rsa.cif 9rub.cif 9wga.cif 9xia.cif 9xim.cif
My alignment_dir looks like this:
101m_A 102l_A 102m_A 103l_A 103m_A [...] 9msi_A 9pai_A 9pai_B 9pcy_A 9rub_A
and, for example 101m_A
's directory looks like this:
bfd_uniclust_hits.a3m mgnify_hits.a3m pdb70_hits.hhr uniref90_hits.a3m
My chain_data_cache.json looks like this:
{ "2v8c_A": { "release_date": "2007-12-18", "seq": "MAGWQSYVDNLMCDGCCQEAAIVGYCDAKYVWAATAGGVFQSITPVEIDMIVGKDREGFFTNGLTLGAKKCSVIRDSLYVDGDCTMDIRTKSQGGEPTYNVAVGRAGRVLVFVMGKEGVHGGGLNKKAYSMAKYLRDSGF", "resolution": 1.98, "cluster_size": -1 }, [next chain]
and my mmcif_cache.json looks like this:
{ "5d75": { "release_date": "2016-04-06", "chain_ids": [ "A" ], "seqs": [ "PKYTKSVLKKGDKTNFPKKGDVVHCWYTGTLQDGTVFDTNIQTSAKKKKNAKPLSFKVGVGKVIRGWDEALLTMSKGEKARLEIEPEWAYGKKGQPDAKIPPNAKLTFEVELVDIDLEHHHHHH" ], "no_chains": 1, "resolution": 1.83 }, [next chain]
I'm trying to train OpenFold using the RODA-provided alignments, on a SLURM cluster with 4 nodes each with 8 A100s. This error also pops up with a single node with 8 A100s. The command is the following:
python3 /fsx/openbioml/openfold/train_openfold.py /fsx/openbioml/openfold_data/pdb_mmcif/mmcif_files/ /fsx/openbioml/openfold_roda/pdb/ /fsx/openbioml/openfold_data/pdb_mmcif/mmcif_files /fsx/openbioml/openfold_out/ 2021-10-10 --template_release_dates_cache_path /fsx/openbioml/openfold_data/mmcif_cache.json --precision 16 --num_nodes 1 --gpus 8 --replace_sampler_ddp=True --seed 42 --deepspeed_config_path /fsx/openbioml/openfold/deepspeed_config.json --checkpoint_every_epoch --train_chain_data_cache_path /fsx/openbioml/openfold_data/chain_data_cache.json --obsolete_pdbs_file_path /fsx/openbioml/openfold_data/pdb_mmcif/obsolete.dat
This is all on the latest commit. Do you have any recommendation? I'm struggling to fix it.
Could you verify programmatically that every single chain in your alignment_dir
has a corresponding .mmcif file in the data_dir
? Take all chain names in the former, split on _
, and search for an mmcif file matching the PDB code.
Thank you so much for the quick response Gustaf. You're right -- 57 chains appear to be missing. I'm unsure what caused it.
The missing chains are 1erk 3d6s 3rvw 3rvw 3wwy 3wwz 4acq 4ftp 4gbo 4kyu 4pp1 4pp1 4pp1 4roq 4ror 5b7o 5fim 5hbg 5i4g 5l0x 6dfz 6iwl 6l7f 6nuy 6nuy 6nuz 6nuz 6nv0 6nv0 6ors 6qwj 6qwj 6qwj 6qwj 6qwj 6qwj 6qwj 6qwj 6qwj 6qwj 6qwj 6ubg 6ubg 6vig 6vj1 6vmu 6x8v 6xbp 6xdl 6xoe 7ay4 7ayl 7elo 7jnq 7kcd 7ra6 7v7y
.
How do you recommend I proceed? Should I just delete their subdirectories in the template directory?
I doubt there were errors during the download and extraction of the PDB mmcif files. It's also weird that the user above reported 395 missing chains while in my case it's only 57...
Delete their alignment_dirs
and rerun. I'll look into what's causing this.
Thank you again Gustaf for the quick response.
I moved all the subdirectories to a separate "missing_mmcif" directory as to make sure no data was lost. I tried rerunning it (same command as before), but I'm now getting a KeyError at line 338 of data_modules.py:
File "/fsx/openbioml/openfold/openfold/data/data_modules.py", line 338, in looped_samples chain_data_cache_entry = chain_data_cache[chain_id] KeyError: '6tif_AAA'
6tif.cif
is present in the mmcif directory, and 6tif_AAA
is present in the alignment/templates directory with what appear to be all the necessary files: bfd_uniclust_hits.a3m mgnify_hits.a3m pdb70_hits.hhr uniref90_hits.a3m
. However, only 6tif_A
and 6tif_B
are present in chain_data_cache.json
-- I'll check which other chains are missing there and move their subdirectories away too.
Will report back tomorrow.
My
chain_data_cache
has that chain, and I tried parsing my copy of 6tif.cif
and found them there too.
Could you verify that this file does not match the copy of 6tif.cif
the installation scripts downloaded for you?