Update sequence coordinates after trimming gappy ends
When trimming multiple sequence alignments (MSAs) at the ends, the output FASTA headers should reflect the new coordinate ranges relative to the original (untrimmed) sequences. This can prevent confusion in alignment-trimming workflows.This functionality cannot be applied when trimming gappy columns in the middle of the alignment. The in-between gappy characters should be ignored when re-numbering the sequence slices.
Example: SEQ_1 original length (without gaps): 49 aa SEQ_2 original length (without gaps): 46 aa SEQ_3 original length (without gaps): 48 aa SEQ_4 original length (without gaps): 46 aa
>SEQ_1
ACGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGTAC
>SEQ_2/50-95
-CGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGT--
>SEQ_3/50-97
GCGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGTC-
>SEQ_4
-CGTACGTACGTACGTA-CTACGTACGTACGTACGTACGTACGTACGT--
After trimming 1 column from the start and 2 columns from the end, the expected output should be:
>SEQ_1/2-47
CGTACGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGT
>SEQ_2/50-95
CGTACGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGT
>SEQ_3/51-96
CGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGT
>SEQ_4
CGTACGTACGTACGTA-CTACGTACGTACGTACGTACGTACGTACGT
This naming scheme makes clear which portion of the original sequence remains after trimming.
Would you consider adding this functionality in a future release? Perhaps with an --update-coordinates flag that registers only when clipping the ends?
Thanks!
Thanks, Evangelos, for using trimAl -
If I understood your request well, you would like to see the coordinates updated when trimming the beginning/end of an MSA, right? Otherwise, it might be problematic to reflect which columns were kept in the output. I don't see a significant problem with indicating that after the sequence name.
However, it worries me that trimAl should interpret the header to update it -
For instance, in your example, I'd expect all sequences to be ...
SEQ_1/2-47 (or 1-46, considering that the index starts at 0). SEQ_2/2-47 SEQ_3/2-47 SEQ_4/2-47
Cheers,
S
Thanks! These could be slices from metagenomics sequences (not necessarily full length) describing a particular domain, coming from different species. So the initial pre-assigned slice coords (e.g., SEQ_3/51-96) can make sense if the first 50 amino acids have been previously sliced off already (probably because it was a multi-domain protein, and the first domain was non-relevant to the alignment). Does this make sense?