proteinfold icon indicating copy to clipboard operation
proteinfold copied to clipboard

AlphaFold2 MSA coverage plot includes "padding" rows as if they were sequences of poly-Alanine

Open tlitfin-unsw opened this issue 8 months ago • 3 comments

Description of the bug

AlphaFold2 multimer pads MSAs to a minimum depth by filling with 0s. In MSA parsing, 0 is interpreted as Alanine such that plots with shallows MSAs are dominated by sequences that look like poly-Alanine. Padding can be removed with msa_mask or num_alignments in the AlphaFold2 features.pkl.

I have a small fix in a local branch. Happy to submit a PR but @keiran-rowell-unsw is also working on a larger refactor to extend msa-coverage plots to all new modules so might be easier to wait for this.

tlitfin-unsw avatar May 07 '25 04:05 tlitfin-unsw

Ah thanks! I wasn't aware of this. There should only be a single generate_sequence_coverage_plot() function and should be able to handle through that.

keiran-rowell-unsw avatar May 07 '25 04:05 keiran-rowell-unsw

@tlitfin-unsw I have dramatically simplified MSA processing using numpy arrays: https://github.com/Australian-Structural-Biology-Computing/proteinfold/blob/3391d8417b643b5c9c277bc908e78530d5bb3b28/bin/utils.py#L216-L232

Hopefully we can have something like:

process_msas(msa_path, prog): 

   if prog = "progam_name" 
      remove_program_msa_quirk(msa)

keiran-rowell-unsw avatar May 07 '25 11:05 keiran-rowell-unsw

I thought this might have been fixed with the extract_metrics refactor and some file parsing work @Mitchob did. I'll look into before v2 release

keiran-rowell-unsw avatar Jul 29 '25 23:07 keiran-rowell-unsw