augur icon indicating copy to clipboard operation
augur copied to clipboard

augur align: inserted Ns are reported incorrectly

Open tomalf2 opened this issue 5 months ago • 3 comments

Hi, I have a question about how the letter "N" is treated in augur align, which probably points to a bug.

Current Behavior

Consider this example.fasta file

>B
MFHLVDFQVTIAEILLIIMRT
>BS
MFHLVDFQVTIAEILLIIMRTN

Between the reference (B) and the target sequence (BS) there's just one difference: an additional "N" at the end of the target. When calling augur align --reference-name B --sequences example.fasta, the file alignment.fasta.insertions.csv is produced indicating:

strain,insertion: 1bp @ ref pos 21

Basically, the insertion of "N" is missing. Why is it so?

Expected behavior

The file alignment.fasta.insertions.csv should be instead:

strain,insertion: 1bp @ ref pos 21
BS,N

How to reproduce

  1. Create a file named example.fasta with the following content:

    >B
    MFHLVDFQVTIAEILLIIMRT
    >BS
    MFHLVDFQVTIAEILLIIMRTN
    
  2. Open a terminal and type augur align --reference-name B --sequences example.fasta

  3. Look for a file named alignment.fasta.insertions.csv

  4. Check that the last inserted N is missing from such file.

Environment (local)

  • Operating system: MacOS 14.7.4 (23H420)
  • augur version 31.2.1, installed via PIP

tomalf2 avatar Jul 23 '25 11:07 tomalf2

Hi @tomalf2,

Augur usually treats "N" as missing or ambiguous sites, as described in our Missing sequence data guide. Specifically for augur align, the Ns are treated as gaps and excluded from the insertions.csv. They are reported in a warning instead:

$ augur align --reference-name B --sequences example.fasta

using mafft to align via:
	mafft --reorder --anysymbol --nomemsave --adjustdirection --thread 1 alignment.fasta.to_align.fasta 1> alignment.fasta 2> alignment.fasta.log 

	Katoh et al, Nucleic Acid Research, vol 30, issue 14
	https://doi.org/10.1093%2Fnar%2Fgkf436

WARNING: 1bp insertion at ref position 21 was due to 'N's or '?'s in provided sequences
Trimmed gaps in B from the alignment

joverlee521 avatar Jul 25 '25 17:07 joverlee521

Hi @joverlee521,

and thank you for your answer. At least now I know that's the intended behaviour of augur. However, there're still two aspects I would like to discuss:

I searched through the documentation of augur and augur align specifically, and found no advise against using augur align with protein sequences. I learn now that this is not possible because of how some characters are treated. For example, "N" is Asparginine and a totally valid character in protein sequences. Fortunately, the first link you shared also says "Different aligners may modify such characters" -- i.e., invalid characters like "E" for Glutamic acid -- "however MAFFT (the default for augur align) will leave them unchanged". So, I guess using augur align for protein sequences is fine except for the insertions of "N" characters. If you confirm that, I believe that enabling the correct alignment of protein sequences by augur align would only require to:

  1. treat "N" as a normal character
  2. use a similarity score matrix like BLOSUM62 instead of what's currently used by augur align
  3. enable these behaviors when the optional argument "--protein" is used.

Such a small and simple change would provide an extremely valuable addition to the augur align module almost for free. I kindly invite you to consider this important feature.

In case this is not possible, It would be best to explain this behaviour in the augur align documentation at least.

tomalf2 avatar Jul 28 '25 10:07 tomalf2

@tomalf2 Thank you for your patient persistence with this issue! You're correct that augur align has always been implicitly designed for nucleotide sequence input. At the very least, we should state this limitation in the docs and help text for the align command. I also like your idea of adding a --protein flag to indicate that the input sequences use a different alphabet.

huddlej avatar Jul 28 '25 16:07 huddlej