trimal icon indicating copy to clipboard operation
trimal copied to clipboard

Update sequence coordinates after trimming gappy ends

Open vagkaratzas opened this issue 3 months ago • 2 comments

When trimming multiple sequence alignments (MSAs) at the ends, the output FASTA headers should reflect the new coordinate ranges relative to the original (untrimmed) sequences. This can prevent confusion in alignment-trimming workflows.This functionality cannot be applied when trimming gappy columns in the middle of the alignment. The in-between gappy characters should be ignored when re-numbering the sequence slices.

Example: SEQ_1 original length (without gaps): 49 aa SEQ_2 original length (without gaps): 46 aa SEQ_3 original length (without gaps): 48 aa SEQ_4 original length (without gaps): 46 aa

>SEQ_1
ACGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGTAC
>SEQ_2/50-95
-CGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGT--
>SEQ_3/50-97
GCGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGTC-
>SEQ_4
-CGTACGTACGTACGTA-CTACGTACGTACGTACGTACGTACGTACGT--

After trimming 1 column from the start and 2 columns from the end, the expected output should be:

>SEQ_1/2-47
CGTACGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGT
>SEQ_2/50-95
CGTACGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGT
>SEQ_3/51-96
CGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGT
>SEQ_4
CGTACGTACGTACGTA-CTACGTACGTACGTACGTACGTACGTACGT

This naming scheme makes clear which portion of the original sequence remains after trimming.

Would you consider adding this functionality in a future release? Perhaps with an --update-coordinates flag that registers only when clipping the ends?

Thanks!

vagkaratzas avatar Oct 13 '25 08:10 vagkaratzas

Thanks, Evangelos, for using trimAl -

If I understood your request well, you would like to see the coordinates updated when trimming the beginning/end of an MSA, right? Otherwise, it might be problematic to reflect which columns were kept in the output. I don't see a significant problem with indicating that after the sequence name.

However, it worries me that trimAl should interpret the header to update it -

For instance, in your example, I'd expect all sequences to be ...

SEQ_1/2-47 (or 1-46, considering that the index starts at 0). SEQ_2/2-47 SEQ_3/2-47 SEQ_4/2-47

Cheers,

S

scapella avatar Oct 13 '25 08:10 scapella

Thanks! These could be slices from metagenomics sequences (not necessarily full length) describing a particular domain, coming from different species. So the initial pre-assigned slice coords (e.g., SEQ_3/51-96) can make sense if the first 50 amino acids have been previously sliced off already (probably because it was a multi-domain protein, and the first domain was non-relevant to the alignment). Does this make sense?

vagkaratzas avatar Oct 13 '25 08:10 vagkaratzas