`uniq` usage causes pLDDT row misalignment
Description of the bug
I might be missing something, but in the following lines:
awk '{print \$6"\\t"\$11}' ranked_0.pdb | uniq > ranked_0_plddt.tsv
for i in 1 2 3 4
do awk '{print \$6"\\t"\$11}' ranked_\$i.pdb | uniq | awk '{print \$2}' > ranked_"\$i"_plddt.tsv
done
uniq is applied only to the pLDDT scores, which removes consecutive duplicate lines. This can result in imbalanced output when different ranked structures have varying numbers of consecutive pLDDT scores, leading to row misalignment and incorrect downstream visualisation (as shown below):
Command used and terminal output
Relevant files
No response
System information
No response
I've started on a MultiQC implementation that uses Biopython to parse the b-factors insteads.
from Bio import PDB
parser = PDB.PDBParser(QUIET=True)
It also supports the .cif output of AlphaFold3
elif samplename.endswith(".cif"):
parser = PDB.MMCIFParser(QUIET=True)
I'll upload when more feature complete, but can provide code snippets if a more robust way to parse pLDDT from structures is desired.
The ESMFold pathway to generate the the _plddt_mqc.tsv includes extra fields to get atom-wise confidences.
awk '{print \$2"\\t"\$3"\\t"\$4"\\t"\$6"\\t"\$11}' ${meta.id}_esmfold.pdb | grep -v 'N/A' | uniq > plddt.tsv
echo -e Atom_serial_number"\\t"Atom_name"\\t"Residue_name"\\t"Residue_sequence_number"\\t"pLDDT > header.tsv
cat header.tsv plddt.tsv > ${meta.id}_plddt_mqc.tsv
In the name of standardisation, can we have all the different modules generate only residue-wise pLDDT, with exactly the same header formatting, so that generate_report.py can rely on this standard?
Fixed in local UNSW deployment, in this commit. Will PR to nf-core when fix-multiqc-intermediates branch more complete.
Approach also used by @tlitfin-unsw for generate_report.py in #264
plddt.tsv misalignment fixed in #306