PAE data in a common format for report generation
Description of feature
Currently the PAE data showing confidence in relative placement of residues (very useful for multimers) is only generated for colabfold using that codebase's functions.
It would be good for extract_output.py to be expanded to pull the PAE data from the various formats that advanced data is saved in for the different folding program modules. A simple PAE data file that generate_report.py could reliably pull from.
The complication is each program stores their PAE data differently.
- [ ] AlphaFold2:
pickle.load(result_model_*.pkl)- see here for script by Cam Hyde @ Galaxy Aus - key: 'predicted_aligned_error' - [ ] AlphaFold3:
[protein]_seed-*_sample-*_confidences.jsonas described here - key: "pae" - [ ] ColabFold:
[protein]_0_scores_rank_*_alphafold2_ptm_model_*_seed_*.json- key: "pae" - [ ] ESMFold: seems it could be possible if you go directly to the model but it's not provided using
esm-fold -i - [ ] HelixFold3:
/run/[protein]-pred-*-*/all_results.json- key: "pae" - [ ] RosettaFold-All-Atom: from the pytorch model returned -
torch.load([prot_aux.pt], map_location="cpu")- key: 'pae' - [ ] Boltz-1:
numpy.load("pae_[protein]_model_0.npz")- key: 'pae'
Other modules being added:
- [ ] RosettaFold2NA
- [ ] OmegaFold
- [ ] Chai
Just adding my +1 .. ideally there should be a small library to handle all interoperation of prediction metadata but most presentation tools simply wrap a bunch of conditionals to handle the different json structure/key names.
Modelcif is possibly the way to go, but support is rather barebones at the moment.. only molstar and chimeraX pretty much.
I don't have familiarity with modelcif, but from what I can gather from this discussion one could extract output and then populate the ma_qa_metric_local_pairwise section?
It is appealing that you'd generate a 'standard' structure file with metadata for deposition, regardless of which folding program you used. But perhaps for a later release?
[WIP] currently expanding extract_output.py in my local fork to output a ${id}_pae.tsv for the above programs. No consistency yet.
I've rounded to 4 d.p. for brevity, but that might not be desired.
Workflows will need to be updated to point to correct files, and bin/generate_report.py updated to create PAE plots.
Implemented for current programs in #306, can reopened for the checklist when new programs merged in.