proteinfold icon indicating copy to clipboard operation
proteinfold copied to clipboard

`uniq` usage causes pLDDT row misalignment

Open jscgh opened this issue 1 year ago • 3 comments

Description of the bug

I might be missing something, but in the following lines:

    awk '{print \$6"\\t"\$11}' ranked_0.pdb | uniq > ranked_0_plddt.tsv
    for i in 1 2 3 4
        do awk '{print \$6"\\t"\$11}' ranked_\$i.pdb | uniq | awk '{print \$2}' > ranked_"\$i"_plddt.tsv
    done

uniq is applied only to the pLDDT scores, which removes consecutive duplicate lines. This can result in imbalanced output when different ranked structures have varying numbers of consecutive pLDDT scores, leading to row misalignment and incorrect downstream visualisation (as shown below):

Image

Command used and terminal output


Relevant files

No response

System information

No response

jscgh avatar Nov 29 '24 04:11 jscgh

I've started on a MultiQC implementation that uses Biopython to parse the b-factors insteads.

from Bio import PDB
parser = PDB.PDBParser(QUIET=True)

It also supports the .cif output of AlphaFold3

elif samplename.endswith(".cif"):
   parser = PDB.MMCIFParser(QUIET=True)

I'll upload when more feature complete, but can provide code snippets if a more robust way to parse pLDDT from structures is desired.

keiran-rowell-unsw avatar Dec 17 '24 23:12 keiran-rowell-unsw

The ESMFold pathway to generate the the _plddt_mqc.tsv includes extra fields to get atom-wise confidences.

    awk '{print \$2"\\t"\$3"\\t"\$4"\\t"\$6"\\t"\$11}' ${meta.id}_esmfold.pdb | grep -v 'N/A' | uniq > plddt.tsv
    echo -e Atom_serial_number"\\t"Atom_name"\\t"Residue_name"\\t"Residue_sequence_number"\\t"pLDDT > header.tsv
    cat header.tsv plddt.tsv > ${meta.id}_plddt_mqc.tsv

In the name of standardisation, can we have all the different modules generate only residue-wise pLDDT, with exactly the same header formatting, so that generate_report.py can rely on this standard?

keiran-rowell-unsw avatar Mar 25 '25 00:03 keiran-rowell-unsw

Fixed in local UNSW deployment, in this commit. Will PR to nf-core when fix-multiqc-intermediates branch more complete.

Approach also used by @tlitfin-unsw for generate_report.py in #264

keiran-rowell-unsw avatar Apr 01 '25 06:04 keiran-rowell-unsw

plddt.tsv misalignment fixed in #306

keiran-rowell-unsw avatar May 27 '25 08:05 keiran-rowell-unsw