Experimental ancient DNA modifications
I decided to use Claude Code to automate testing strobealign's options with ancient DNA and gave it bwa and SHRiMP's code alongside with other references and it started modifying code and came up with this:
https://github.com/teepean/strobealign/tree/aDNA
Here's some results with one test case:
https://github.com/teepean/strobealign/blob/aDNA/FINAL_COMPARISON_ANALYSIS.md
This should explain the idea that it got to feed unaligned reads from K=16 and map those with k=12. Read length of 30 was my idea as the data is preprocessed with adaterremoval and reads below 30 are discarded along with removing adapters and merging reads.
echo "Multi-k fallback strategy for ancient DNA:"
echo " 1. Align with k=16 (--ancient-dna --mcs=always -S 0.95)"
echo " 2. Extract unmapped reads"
echo " 3. Realign unmapped with k=12"
echo " 4. Merge and output final BAM"
echo ""
echo "Parameters:"
echo " threads: number of threads (default: 16)"
echo " read_length: -r parameter for strobealign (default: 30, recommended for aDNA)"
echo ""
I would be thankful if anyone would like to verify the results as they seem to be too good to be true even after using several methods to check the results.
Very interesting. I will have a discussion with @marcelm and @Itolstoganov about it on the next meeting.
Thanks! Honestly if this is not me and the LLM hallucinating together, this would be one step forward replacing bwa aln as de facto ancient DNA aligner.
Hi @teepean, impressive results!
We will run your branch on simulated aDNA data to see if seeding with relatively low -k values negatively affects accuracy. I will share the results this week.
Thanks!
Hi @teepean,
I made a small sanity check benchmark on simulated data to see how strobealign deals with the aDNA sample contamination. Specifically, I simulated 100,000 paired-end 40 bp reads using gargammel, from the following references
- Endogenous aDNA (chromosome 21, from gargammel test data)
- Human contamination reference (also chr21 from gargammel test data)
- Bacterial contamination database k14
The simulation included 35,000 endogenous, 5,000 human contamination, and 60,000 bacterial contamination reads. I then benchmarked the following tools by aligning all simulated reads to the endogenous reference.
- BWA aln v0.7.18
- Current strobealign main
- Your aDNA-modified version
The evaluation metrics were:
- "% Mapped endogenous reads" – % of mapped reads simulated from endogenous reference
- "% Mapped bacterial/contaminated reads" – % of mapped reads simulated from bacterial/contaminated reference.
The reads with alignment score less than 20 were filtered out. Here are the results:
| Tool | % Mapped endogenous reads | % Mapped bacterial reads | % Mapped contaminated reads |
|---|---|---|---|
| strobealign aDNA | 99.60 | 95.75 | 99.62 |
| strobealign main (15e4318) | 99.98 | 95.70 | 99.97 |
| BWA aln | 99.91 | 0.02 | 99.94 |
Unfortunately, it seems that both the default and the aDNA strobealign versions align nearly all of the bacterial reads (although with lower mean alignment score than endogenous reads). This is probably caused by spurious 12-mer matches reported by strobealign. At this stage, strobealign may not yet be suitable for aDNA data without further adjustments.
@teepean Despite the current issues with the bacterial contamination alignment, we believe that developing aDNA mode for strobealign could be worthwhile. Please contact me via the email in my profile or @ksahlin if you are interested in this!