openfold icon indicating copy to clipboard operation
openfold copied to clipboard

A question about the parameter cluster_file in generate_chain_data_cache.py.

Open Wenjian-Ma opened this issue 3 years ago • 20 comments

image "clusters-by-entity-40.txt" as the parameter cluster_file is required by generate_chain_data_cache.py. But I don't find this file in my path, so I want to know how the file "clusters-by-entity-40.txt" generates.

Thanks so much!

Wenjian-Ma avatar Feb 13 '22 08:02 Wenjian-Ma

The cluster file isn't downloaded by default by the installation script, as the cluster file you'll want to use depends on your choice of training databases. For the default AlphaFold databases, you'll want to download the cluster file linked in the screenshot you sent. Hope this clears things up.

gahdritz avatar Feb 14 '22 18:02 gahdritz

Thanks so much! There is another question: I use the same procedure DeepMind used by running "python3 scripts/precompute_alignments.py" for generating MSAs, but this procedure computes MSAs of ~180,000 protein chain, takes up a lot of time and computing resources, so can I just compute ~400 MSAs so that I just run the code successfully regardless of model effects?

Wenjian-Ma avatar Feb 15 '22 12:02 Wenjian-Ma

Just move 400 mmCIF files to a new directory and run the script on those.

gahdritz avatar Feb 15 '22 17:02 gahdritz

image

Thanks for your answer! There is still a question that I need your help: Is the value of first path parameter "mmcif_dir/" the same as the value of third path parameter "template_mmcif_dir/"?

Wenjian-Ma avatar Feb 17 '22 08:02 Wenjian-Ma

mmcif_dir contains training data in the form of .cif mmCIF files or PDB files with all of the columns output by OpenFold (the latter format is intended to be used for self-distillation chains). template_mmcif_dir contains mmCIF files for template chains. Sometimes, these can refer to the same directory (during the first phase of AlphaFold 2 training, for example).

gahdritz avatar Feb 17 '22 18:02 gahdritz

Thanks! Besides, can this code about the OpenFold run with reference to readme.txt? Because "train_openfold.py" requires a parameters "max_template_date", but it doesn't set by readme.txt, as shown in the figure. image

Wenjian-Ma avatar Feb 22 '22 03:02 Wenjian-Ma

It's a positional parameter on the second line ("2021-10-10")

gahdritz avatar Feb 22 '22 03:02 gahdritz

Yep...sorry...I am careless... There is a new question: I have computed 370 alignments, which covered 151 proteins, but when I run the "generate_chain_data_cache.py", I got a output file"chain_data_cache.json" whose content is "{}", so when I run the "train_openfold.py", I got an error:"chain_data_cache_entry = chain_data_cache[chain_id] KeyError: '5g2e_A'". How can I solve this error? Do you have any suggestions? image

Wenjian-Ma avatar Feb 22 '22 04:02 Wenjian-Ma

I just pushed a commit that changed the default behavior of the chain data cache file. Previously, chains that didn't appear in the chain cluster file were excluded entirely. Now, they're set to -1. Try the same experiment again.

gahdritz avatar Feb 22 '22 22:02 gahdritz

image

image the base class of "OpenFoldWrapper" is "LightningModule", which also has no attribute '_compute_validation_metrics'. Is this a bug?

Wenjian-Ma avatar Mar 01 '22 06:03 Wenjian-Ma

This has since been fixed. Try pulling one more time.

gahdritz avatar Mar 01 '22 19:03 gahdritz

Thanks!but there is a new error: image should I change the parameters "num_workers" in config.py? as shown in following figure: image

Wenjian-Ma avatar Mar 02 '22 12:03 Wenjian-Ma

That's odd---I've never seen this one before. Could you share more information about what you might be doing to trigger it? Possibly share minimal reproduction code?

gahdritz avatar Mar 02 '22 19:03 gahdritz

image I don't change your code, just run the code based on your readme.txt

Wenjian-Ma avatar Mar 03 '22 05:03 Wenjian-Ma

trainging data is 370 alignments, which covered 151 proteins. Error information is as follows: error information.txt

Wenjian-Ma avatar Mar 03 '22 05:03 Wenjian-Ma

I tried reproducing this earlier but it worked normally. What versions of each package are you running?

gahdritz avatar Mar 03 '22 05:03 gahdritz

package version.txt package version

Wenjian-Ma avatar Mar 03 '22 05:03 Wenjian-Ma

I have resolved this problem via insert a line of code "torch.multiprocessing.set_sharing_strategy('file_system')"

Wenjian-Ma avatar Mar 13 '22 19:03 Wenjian-Ma

But I get a same prolem as #50, i.e., loss is NaN.

Wenjian-Ma avatar Mar 13 '22 19:03 Wenjian-Ma

I can't see the download link for PDB40 in current main branch. I found one here, though: https://www.rcsb.org/docs/programmatic-access/file-download-services

mooninrain avatar Jun 07 '23 05:06 mooninrain