AlphaFold2 MSA coverage plot includes "padding" rows as if they were sequences of poly-Alanine
Description of the bug
AlphaFold2 multimer pads MSAs to a minimum depth by filling with 0s. In MSA parsing, 0 is interpreted as Alanine such that plots with shallows MSAs are dominated by sequences that look like poly-Alanine. Padding can be removed with msa_mask or num_alignments in the AlphaFold2 features.pkl.
I have a small fix in a local branch. Happy to submit a PR but @keiran-rowell-unsw is also working on a larger refactor to extend msa-coverage plots to all new modules so might be easier to wait for this.
Ah thanks! I wasn't aware of this. There should only be a single generate_sequence_coverage_plot() function and should be able to handle through that.
@tlitfin-unsw I have dramatically simplified MSA processing using numpy arrays: https://github.com/Australian-Structural-Biology-Computing/proteinfold/blob/3391d8417b643b5c9c277bc908e78530d5bb3b28/bin/utils.py#L216-L232
Hopefully we can have something like:
process_msas(msa_path, prog):
if prog = "progam_name"
remove_program_msa_quirk(msa)
I thought this might have been fixed with the extract_metrics refactor and some file parsing work @Mitchob did. I'll look into before v2 release