proteinfold `uniq` usage causes pLDDT row misalignment

Description of the bug

I might be missing something, but in the following lines:

    awk '{print \$6"\\t"\$11}' ranked_0.pdb | uniq > ranked_0_plddt.tsv
    for i in 1 2 3 4
        do awk '{print \$6"\\t"\$11}' ranked_\$i.pdb | uniq | awk '{print \$2}' > ranked_"\$i"_plddt.tsv
    done

uniq is applied only to the pLDDT scores, which removes consecutive duplicate lines. This can result in imbalanced output when different ranked structures have varying numbers of consecutive pLDDT scores, leading to row misalignment and incorrect downstream visualisation (as shown below):

Command used and terminal output

Relevant files

No response

System information

No response

Nov 29 '24 04:11 jscgh

I've started on a MultiQC implementation that uses Biopython to parse the b-factors insteads.

from Bio import PDB
parser = PDB.PDBParser(QUIET=True)

It also supports the .cif output of AlphaFold3

elif samplename.endswith(".cif"):
   parser = PDB.MMCIFParser(QUIET=True)

I'll upload when more feature complete, but can provide code snippets if a more robust way to parse pLDDT from structures is desired.

Dec 17 '24 23:12 keiran-rowell-unsw

The ESMFold pathway to generate the the _plddt_mqc.tsv includes extra fields to get atom-wise confidences.

    awk '{print \$2"\\t"\$3"\\t"\$4"\\t"\$6"\\t"\$11}' ${meta.id}_esmfold.pdb | grep -v 'N/A' | uniq > plddt.tsv
    echo -e Atom_serial_number"\\t"Atom_name"\\t"Residue_name"\\t"Residue_sequence_number"\\t"pLDDT > header.tsv
    cat header.tsv plddt.tsv > ${meta.id}_plddt_mqc.tsv

In the name of standardisation, can we have all the different modules generate only residue-wise pLDDT, with exactly the same header formatting, so that generate_report.py can rely on this standard?

Mar 25 '25 00:03 keiran-rowell-unsw

Fixed in local UNSW deployment, in this commit. Will PR to nf-core when fix-multiqc-intermediates branch more complete.

Approach also used by @tlitfin-unsw for generate_report.py in #264

Apr 01 '25 06:04 keiran-rowell-unsw

plddt.tsv misalignment fixed in #306

May 27 '25 08:05 keiran-rowell-unsw