openfold A question about the parameter cluster_file in generate_chain_data

"clusters-by-entity-40.txt" as the parameter cluster_file is required by generate_chain_data_cache.py. But I don't find this file in my path, so I want to know how the file "clusters-by-entity-40.txt" generates.

Thanks so much!

Feb 13 '22 08:02 Wenjian-Ma

The cluster file isn't downloaded by default by the installation script, as the cluster file you'll want to use depends on your choice of training databases. For the default AlphaFold databases, you'll want to download the cluster file linked in the screenshot you sent. Hope this clears things up.

Feb 14 '22 18:02 gahdritz

Thanks so much! There is another question: I use the same procedure DeepMind used by running "python3 scripts/precompute_alignments.py" for generating MSAs, but this procedure computes MSAs of ~180,000 protein chain, takes up a lot of time and computing resources, so can I just compute ~400 MSAs so that I just run the code successfully regardless of model effects?

Feb 15 '22 12:02 Wenjian-Ma

Just move 400 mmCIF files to a new directory and run the script on those.

Feb 15 '22 17:02 gahdritz

Thanks for your answer! There is still a question that I need your help: Is the value of first path parameter "mmcif_dir/" the same as the value of third path parameter "template_mmcif_dir/"?

Feb 17 '22 08:02 Wenjian-Ma

mmcif_dir contains training data in the form of .cif mmCIF files or PDB files with all of the columns output by OpenFold (the latter format is intended to be used for self-distillation chains). template_mmcif_dir contains mmCIF files for template chains. Sometimes, these can refer to the same directory (during the first phase of AlphaFold 2 training, for example).

Feb 17 '22 18:02 gahdritz

Thanks! Besides, can this code about the OpenFold run with reference to readme.txt? Because "train_openfold.py" requires a parameters "max_template_date", but it doesn't set by readme.txt, as shown in the figure.

Feb 22 '22 03:02 Wenjian-Ma

It's a positional parameter on the second line ("2021-10-10")

Feb 22 '22 03:02 gahdritz

Yep...sorry...I am careless... There is a new question: I have computed 370 alignments, which covered 151 proteins, but when I run the "generate_chain_data_cache.py", I got a output file"chain_data_cache.json" whose content is "{}", so when I run the "train_openfold.py", I got an error:"chain_data_cache_entry = chain_data_cache[chain_id] KeyError: '5g2e_A'". How can I solve this error? Do you have any suggestions?

Feb 22 '22 04:02 Wenjian-Ma

I just pushed a commit that changed the default behavior of the chain data cache file. Previously, chains that didn't appear in the chain cluster file were excluded entirely. Now, they're set to -1. Try the same experiment again.

Feb 22 '22 22:02 gahdritz

the base class of "OpenFoldWrapper" is "LightningModule", which also has no attribute '_compute_validation_metrics'. Is this a bug?

Mar 01 '22 06:03 Wenjian-Ma

This has since been fixed. Try pulling one more time.

Mar 01 '22 19:03 gahdritz

Thanks！but there is a new error: should I change the parameters "num_workers" in config.py? as shown in following figure:

Mar 02 '22 12:03 Wenjian-Ma

That's odd---I've never seen this one before. Could you share more information about what you might be doing to trigger it? Possibly share minimal reproduction code?

Mar 02 '22 19:03 gahdritz

I don't change your code, just run the code based on your readme.txt

Mar 03 '22 05:03 Wenjian-Ma

trainging data is 370 alignments, which covered 151 proteins. Error information is as follows: error information.txt

Mar 03 '22 05:03 Wenjian-Ma

I tried reproducing this earlier but it worked normally. What versions of each package are you running?

Mar 03 '22 05:03 gahdritz

package version.txt package version

Mar 03 '22 05:03 Wenjian-Ma

I have resolved this problem via insert a line of code "torch.multiprocessing.set_sharing_strategy('file_system')"

Mar 13 '22 19:03 Wenjian-Ma

But I get a same prolem as #50, i.e., loss is NaN.

Mar 13 '22 19:03 Wenjian-Ma

I can't see the download link for PDB40 in current main branch. I found one here, though: https://www.rcsb.org/docs/programmatic-access/file-download-services

Jun 07 '23 05:06 mooninrain

openfold
openfold copied to clipboard

A question about the parameter cluster_file in generate_chain_data_cache.py.

openfold openfold copied to clipboard

A question about the parameter cluster_file in generate_chain_data_cache.py.

openfold
openfold copied to clipboard