openfold icon indicating copy to clipboard operation
openfold copied to clipboard

Training Runtime Error: StopIteration

Open bozhang-hpc opened this issue 2 years ago • 36 comments

Hi,

I'm using the released training data on AWS and the latest main branch to train the model.

  1. The directory structure of the released data is not recognized by the code.
  2. After re-structuring the directories and put all the .hhr and .a3m under the alignment directory, the code crashes at File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 377, in reroll datapoint_idx = next(samples) with default settings.

Any idea to solve this?

Thanks,

Bo

The full trace back is as below:

Traceback (most recent call last):
  File "train_openfold.py", line 548, in <module>
    main(args)
  File "train_openfold.py", line 341, in main
    ckpt_path=ckpt_path,
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 140, in run
    self.on_run_start(*args, **kwargs)
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 197, in on_run_start
    self.trainer.reset_train_val_dataloaders(self.trainer.lightning_module)
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 595, in reset_train_val_dataloaders
    self.reset_train_dataloader(model=model)
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 365, in reset_train_dataloader
    self.train_dataloader = self.request_dataloader(RunningStage.TRAINING, model=model)
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 611, in request_dataloader
    dataloader = source.dataloader()
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 300, in dataloader
    return method()
  File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 694, in train_dataloader
    return self._gen_dataloader("train") 
  File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 671, in _gen_dataloader
    dataset.reroll()
  File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 377, in reroll
    datapoint_idx = next(samples)
StopIteration
srun: error: nid001680: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=2466693.0

bozhang-hpc avatar Jun 24 '22 20:06 bozhang-hpc

Hi,

I dug a little deeper into this bug by add the exception handling:

try:
     datapoint_idx = next(samples)
except StopIteration:
     print("samples.length = {}, idx = {}".format(sum(1 for _ in samples), dataset_idx))

but the result shows that samples.length = 0, idx = 0

just FYI.

Bests,

Bo

bozhang-hpc avatar Jun 24 '22 20:06 bozhang-hpc

Try increasing the length of the training epoch.

gahdritz avatar Jun 24 '22 21:06 gahdritz

Try increasing the length of the training epoch.

So this means set --train_epoch_len to a larger value?

bozhang-hpc avatar Jun 24 '22 21:06 bozhang-hpc

Yes. As for the format of the AWS data, you'll need to reformat it slightly. For each chain, you'll want to flatten the hhr and a3m directories, such that the directory for each chain contains both the hhr files and the a3m files for that chain.

gahdritz avatar Jun 24 '22 21:06 gahdritz

Yeah. I've done that. But I found some chains doesn't have all the 4 files(3 .a3m and 1 .hhr).

By the way, I've set the --train_epoch_len to 80000 but still not working. If I set it to <=1200, it works but doesn't make sense.

bozhang-hpc avatar Jun 24 '22 21:06 bozhang-hpc

Could you share your training command? Also, have you modified your config at all?

gahdritz avatar Jun 24 '22 21:06 gahdritz

srun python3 train_openfold.py \
/pscratch/sd/b/bz186/openfold/data/pdb_mmcif/mmcif_files \ 
/pscratch/sd/b/bz186/openfold/data/alignment_openfold \ 
/pscratch/sd/b/bz186/openfold/data/pdb_mmcif/mmcif_files \
/pscratch/sd/b/bz186/openfold/data/train_full_output \
2021-10-10 \
--template_release_dates_cache_path=/pscratch/sd/b/bz186/openfold/data/mmcif_cache.json \
--precision=32 \
--gpus=4 \
--replace_sampler_ddp=True \
--seed=42 \
--deepspeed_config_path=/global/homes/b/bz186/openfold/deepspeed_config.json \
--checkpoint_every_epoch \
--obsolete_pdbs_file_path=/pscratch/sd/b/bz186/openfold/data/pdb_mmcif/obsolete.dat \
--train_chain_data_cache_path=/pscratch/sd/b/bz186/openfold/data/chain_data_cache.json \
--train_epoch_len=80000

I didn't modify anything compared to the latest main branch.

bozhang-hpc avatar Jun 24 '22 21:06 bozhang-hpc

Is this being run on a single SLURM node, or multiple?

gahdritz avatar Jun 24 '22 21:06 gahdritz

A single node with 4 GPUs , I did it on interactive mode.

bozhang-hpc avatar Jun 24 '22 21:06 bozhang-hpc

Hm not able to reproduce. Could you share a sample of the directory structure of /pscratch/sd/b/bz186/openfold/data/alignment_openfold? Also, try placing a print statement in the if block starting on line 333 of openfold/data/data_modules.py, where it filters out certain chains using the chain data cache. Is every protein in your dataset getting filtered?

gahdritz avatar Jun 24 '22 21:06 gahdritz

/pscratch/sd/b/bz186/openfold/data/alignment_openfold
    -11as_A
        - bfd_uniclust_hits.a3m
        - mgnify_hits.a3m
        - pdb70_hits.hhr
        - uniref90_hits.a3m
    -11ba_A
        - bfd_uniclust_hits.a3m
        - mgnify_hits.a3m
        - pdb70_hits.hhr
        - uniref90_hits.a3m
    -11ba_B
    -11bg_A
    -11bg_B
    -11gs_A

The directory tree is similar as above for all chains ,except some doesn't have all 4 files.

bozhang-hpc avatar Jun 24 '22 21:06 bozhang-hpc

How about the filter thing (I edited my previous message)?

gahdritz avatar Jun 24 '22 21:06 gahdritz

It prints some of the chain_data_cache_entry like this: {'release_date': '2011-02-09', 'seq': 'MSAGKLPEGWVIAPVSTVTTLIRGVTYKKEQAINYLKDDYLPLIRANNIQNGKFDTTDLVFVPKNLVKESQKISPEDIVIAMSSGSKSVVGKSAHQHLPFECSFGAFCGVLRPEKLIFSGFIAHFTKSSLYRNKISSLSAGANINNIKPASFDLINIPIPPLAEQKIIAEKLDTLLAQVDSTKARFEQIPQILKRFRQAVLGGAVNGKLTEKWRNFEPQHSVFKKLNFESILTELRNGLSSKPNESGVGHPILRISSVRAGHVDQNDIRFLECSESELNRHKLQDGDLLFTRYNGSLEFVGVCGLLKKLQHQNLLYPDKLIRARLTKDALPEYIEIFFSSPSARNAMMNCVKTTSGQKGISGKDIKSQVVLLPPVKEQAEIVRRVEQLFAYADTIEKQVNNALARVNNLTQSILAKAFRGELTAQWRAENPDLISGENSAAALLEKIKAERAASGGKKASRKKS', 'resolution': 18.0, 'cluster_size': -1}

The count number of the if block is 26 before it crashes.

bozhang-hpc avatar Jun 24 '22 21:06 bozhang-hpc

Could you count how many of your proteins are getting filtered out by that function? I see for example that the resolution of this protein is very high---this one gets filtered for that reason.

gahdritz avatar Jun 24 '22 22:06 gahdritz

The count number of the if block is 26 before it crashes.

bozhang-hpc avatar Jun 24 '22 22:06 bozhang-hpc

How about the probabilities that come out of the stochastic filter? Are those really small?

gahdritz avatar Jun 24 '22 22:06 gahdritz

I print the p right after

p = get_stochastic_train_filter_prob(
                        chain_data_cache_entry,
                    )

almost all of them are > 0.5

p = 1.0
p = 0.609375
p = 0.880859375
p = 0.5
p = 0.5
p = 0.5
p = 0.693359375
p = 0.5
p = 0.5
p = 0.5
p = 0.5
p = 0.6640625
p = 0.822265625
p = 0.001953125
p = 0.5
p = 0.5
p = 0.9140625
p = 0.5
p = 0.5390625
p = 0.52734375
p = 0.5
p = 0.5
p = 0.5
p = 0.51953125
p = 1.0
p = 0.53125
p = 0.5
p = 0.5
p = 0.5
p = 0.5
p = 0.5

bozhang-hpc avatar Jun 24 '22 22:06 bozhang-hpc

Hm so that's not it either. Whatever the issue is, it's probably happening somewhere in that function. Would you mind pinpointing where the samples are disappearing?

gahdritz avatar Jun 24 '22 22:06 gahdritz

I'm trying to do that. But we have some difficulties to understand the code, since we'are only computer science research and do not have required protein knowledge.

bozhang-hpc avatar Jun 24 '22 22:06 bozhang-hpc

In the reroll() function, could you verify that len(self._samples) == 1 and also that torch.sum(dataset_choices) == 0?

gahdritz avatar Jun 24 '22 23:06 gahdritz

yes, they are. 1 and tensor(0).

bozhang-hpc avatar Jun 24 '22 23:06 bozhang-hpc

since the self._samples only has 1 element inside it, I tried to investigate the self._samples[0] by doing print(sum(1 for _ in self._samples[0])). This time, it tells me that

Traceback (most recent call last):
  File "train_openfold.py", line 548, in <module>
    main(args)
  File "train_openfold.py", line 262, in main
    data_module.setup()
  File "/global/homes/b/bz186/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/pytorch_lightning/core/datamodule.py", line 474, in wrapped_fn
    fn(*args, **kwargs)
  File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 654, in setup
    _roll_at_init=False,
  File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 358, in __init__
    print(sum(1 for _ in self._samples[0]))
  File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 358, in <genexpr>
    print(sum(1 for _ in self._samples[0]))
  File "/global/u2/b/bz186/openfold/openfold/data/data_modules.py", line 334, in looped_samples
    chain_data_cache_entry = chain_data_cache[chain_id]
KeyError: '3wwy_A'

But when I check the alignment_dir, this chain does exist:

(openfold_venv) bz186@nid001520:~/openfold> ls /pscratch/sd/b/bz186/openfold/data/alignment_openfold | grep 3wwy_A
3wwy_A
(openfold_venv) bz186@nid001520:~/openfold> ls /pscratch/sd/b/bz186/openfold/data/alignment_openfold/3wwy_A
bfd_uniclust_hits.a3m  mgnify_hits.a3m	pdb70_hits.hhr	uniref90_hits.a3m

I'm assuming there is something wrong with the file passed by --train_chain_data_cache_path. But I did that pre-processing step using the full mmcif dataset downloaded by the script and the PDB40 in the README link. Is there any potential error that could happens given that the AWS dataset is claimed to have ~130,000 chains, however the full mmcif dataset has ~190,000 .cif files

bozhang-hpc avatar Jun 24 '22 23:06 bozhang-hpc

To clarify, self._samples contains one infinite iterator for each dataset you're using. Each entry thereof feeds you samples from the corresponding dataset shuffled + filtered in various ways forever.

Yes, it's looking like something is wrong with your train chain data cache. It should contain an entry for every chain for which you have alignments. The total number of .cif files doesn't matter; what controls the size of the dataset objects is the alignment_dir. I'll revisit the chain data cache script and see if I can find any errors. In the meantime, you can add a check in that function we were discussing earlier that skips chains that do not appear in your chain data cache.

gahdritz avatar Jun 24 '22 23:06 gahdritz

I also ran into the same problem. I found 395 unique proteins in my msa_dir that wasn't in my mmcif_files dir and also in my chain_data_cache.json. I removed all the chains for the 395 proteins, and the training works for me. I'm guessing there is a difference in the data used to generate the msa.

sharish24 avatar Jul 05 '22 19:07 sharish24

I'm also encountering this issue. Line 383, which is now the same line as what (in @llwx593's screenshot over in #113) was line 380. StopIteration exception.

My mmcif_dir looks like this: 100d.cif 101d.cif 101m.cif 102d.cif 102l.cif [...] 9rsa.cif 9rub.cif 9wga.cif 9xia.cif 9xim.cif

My alignment_dir looks like this: 101m_A 102l_A 102m_A 103l_A 103m_A [...] 9msi_A 9pai_A 9pai_B 9pcy_A 9rub_A and, for example 101m_A's directory looks like this: bfd_uniclust_hits.a3m mgnify_hits.a3m pdb70_hits.hhr uniref90_hits.a3m

My chain_data_cache.json looks like this:

{ "2v8c_A": { "release_date": "2007-12-18", "seq": "MAGWQSYVDNLMCDGCCQEAAIVGYCDAKYVWAATAGGVFQSITPVEIDMIVGKDREGFFTNGLTLGAKKCSVIRDSLYVDGDCTMDIRTKSQGGEPTYNVAVGRAGRVLVFVMGKEGVHGGGLNKKAYSMAKYLRDSGF", "resolution": 1.98, "cluster_size": -1 }, [next chain]

and my mmcif_cache.json looks like this:

{ "5d75": { "release_date": "2016-04-06", "chain_ids": [ "A" ], "seqs": [ "PKYTKSVLKKGDKTNFPKKGDVVHCWYTGTLQDGTVFDTNIQTSAKKKKNAKPLSFKVGVGKVIRGWDEALLTMSKGEKARLEIEPEWAYGKKGQPDAKIPPNAKLTFEVELVDIDLEHHHHHH" ], "no_chains": 1, "resolution": 1.83 }, [next chain]

I'm trying to train OpenFold using the RODA-provided alignments, on a SLURM cluster with 4 nodes each with 8 A100s. This error also pops up with a single node with 8 A100s. The command is the following: python3 /fsx/openbioml/openfold/train_openfold.py /fsx/openbioml/openfold_data/pdb_mmcif/mmcif_files/ /fsx/openbioml/openfold_roda/pdb/ /fsx/openbioml/openfold_data/pdb_mmcif/mmcif_files /fsx/openbioml/openfold_out/ 2021-10-10 --template_release_dates_cache_path /fsx/openbioml/openfold_data/mmcif_cache.json --precision 16 --num_nodes 1 --gpus 8 --replace_sampler_ddp=True --seed 42 --deepspeed_config_path /fsx/openbioml/openfold/deepspeed_config.json --checkpoint_every_epoch --train_chain_data_cache_path /fsx/openbioml/openfold_data/chain_data_cache.json --obsolete_pdbs_file_path /fsx/openbioml/openfold_data/pdb_mmcif/obsolete.dat

This is all on the latest commit. Do you have any recommendation? I'm struggling to fix it.

NZ99 avatar Jul 28 '22 16:07 NZ99

Could you verify programmatically that every single chain in your alignment_dir has a corresponding .mmcif file in the data_dir? Take all chain names in the former, split on _, and search for an mmcif file matching the PDB code.

gahdritz avatar Jul 28 '22 18:07 gahdritz

Thank you so much for the quick response Gustaf. You're right -- 57 chains appear to be missing. I'm unsure what caused it.

The missing chains are 1erk 3d6s 3rvw 3rvw 3wwy 3wwz 4acq 4ftp 4gbo 4kyu 4pp1 4pp1 4pp1 4roq 4ror 5b7o 5fim 5hbg 5i4g 5l0x 6dfz 6iwl 6l7f 6nuy 6nuy 6nuz 6nuz 6nv0 6nv0 6ors 6qwj 6qwj 6qwj 6qwj 6qwj 6qwj 6qwj 6qwj 6qwj 6qwj 6qwj 6ubg 6ubg 6vig 6vj1 6vmu 6x8v 6xbp 6xdl 6xoe 7ay4 7ayl 7elo 7jnq 7kcd 7ra6 7v7y.

How do you recommend I proceed? Should I just delete their subdirectories in the template directory?

I doubt there were errors during the download and extraction of the PDB mmcif files. It's also weird that the user above reported 395 missing chains while in my case it's only 57...

NZ99 avatar Jul 28 '22 19:07 NZ99

Delete their alignment_dirs and rerun. I'll look into what's causing this.

gahdritz avatar Jul 28 '22 20:07 gahdritz

Thank you again Gustaf for the quick response.

I moved all the subdirectories to a separate "missing_mmcif" directory as to make sure no data was lost. I tried rerunning it (same command as before), but I'm now getting a KeyError at line 338 of data_modules.py:

File "/fsx/openbioml/openfold/openfold/data/data_modules.py", line 338, in looped_samples chain_data_cache_entry = chain_data_cache[chain_id] KeyError: '6tif_AAA'

6tif.cif is present in the mmcif directory, and 6tif_AAA is present in the alignment/templates directory with what appear to be all the necessary files: bfd_uniclust_hits.a3m mgnify_hits.a3m pdb70_hits.hhr uniref90_hits.a3m. However, only 6tif_A and 6tif_B are present in chain_data_cache.json -- I'll check which other chains are missing there and move their subdirectories away too.

Will report back tomorrow.

NZ99 avatar Jul 28 '22 22:07 NZ99

image My chain_data_cache has that chain, and I tried parsing my copy of 6tif.cif and found them there too.

Could you verify that this file does not match the copy of 6tif.cif the installation scripts downloaded for you?

gahdritz avatar Jul 28 '22 23:07 gahdritz