openfold
openfold copied to clipboard
A question about the parameter cluster_file in generate_chain_data_cache.py.
"clusters-by-entity-40.txt" as the parameter cluster_file is required by generate_chain_data_cache.py. But I don't find this file in my path, so I want to know how the file "clusters-by-entity-40.txt" generates.
Thanks so much!
The cluster file isn't downloaded by default by the installation script, as the cluster file you'll want to use depends on your choice of training databases. For the default AlphaFold databases, you'll want to download the cluster file linked in the screenshot you sent. Hope this clears things up.
Thanks so much! There is another question: I use the same procedure DeepMind used by running "python3 scripts/precompute_alignments.py" for generating MSAs, but this procedure computes MSAs of ~180,000 protein chain, takes up a lot of time and computing resources, so can I just compute ~400 MSAs so that I just run the code successfully regardless of model effects?
Just move 400 mmCIF files to a new directory and run the script on those.
Thanks for your answer! There is still a question that I need your help: Is the value of first path parameter "mmcif_dir/" the same as the value of third path parameter "template_mmcif_dir/"?
mmcif_dir
contains training data in the form of .cif mmCIF files or PDB files with all of the columns output by OpenFold (the latter format is intended to be used for self-distillation chains). template_mmcif_dir
contains mmCIF files for template chains. Sometimes, these can refer to the same directory (during the first phase of AlphaFold 2 training, for example).
Thanks! Besides, can this code about the OpenFold run with reference to readme.txt? Because "train_openfold.py" requires a parameters "max_template_date", but it doesn't set by readme.txt, as shown in the figure.
It's a positional parameter on the second line ("2021-10-10")
Yep...sorry...I am careless... There is a new question: I have computed 370 alignments, which covered 151 proteins, but when I run the "generate_chain_data_cache.py", I got a output file"chain_data_cache.json" whose content is "{}", so when I run the "train_openfold.py", I got an error:"chain_data_cache_entry = chain_data_cache[chain_id] KeyError: '5g2e_A'". How can I solve this error? Do you have any suggestions?
I just pushed a commit that changed the default behavior of the chain data cache file. Previously, chains that didn't appear in the chain cluster file were excluded entirely. Now, they're set to -1. Try the same experiment again.
the base class of "OpenFoldWrapper" is "LightningModule", which also has no attribute '_compute_validation_metrics'. Is this a bug?
This has since been fixed. Try pulling one more time.
Thanks!but there is a new error:
should I change the parameters "num_workers" in config.py? as shown in following figure:
That's odd---I've never seen this one before. Could you share more information about what you might be doing to trigger it? Possibly share minimal reproduction code?
I don't change your code, just run the code based on your readme.txt
trainging data is 370 alignments, which covered 151 proteins. Error information is as follows: error information.txt
I tried reproducing this earlier but it worked normally. What versions of each package are you running?
package version.txt package version
I have resolved this problem via insert a line of code "torch.multiprocessing.set_sharing_strategy('file_system')"
But I get a same prolem as #50, i.e., loss is NaN.
I can't see the download link for PDB40 in current main branch. I found one here, though: https://www.rcsb.org/docs/programmatic-access/file-download-services