eager icon indicating copy to clipboard operation
eager copied to clipboard

Using modified strobealign to align ancient DNA

Open teepean opened this issue 2 months ago • 6 comments

Is your feature request related to a problem? Please describe

This is a request to make sure I or LLM I used are not hallucinating.

Describe the solution you'd like

I started investigating if strobealign could be used to align ancient DNA and after a while decide to use LLM (Claude Code) to automate my testing process and I decided to feed different papers + bwa's source code to it. The results are here:

https://github.com/teepean/strobealign/tree/aDNA

strobealign-ancient-multik.sh is used to perform the alignment. The sample I used was processed with adapterremoval before aligning. I decided to post the current results now.

https://www.ebi.ac.uk/ena/browser/view/SAMEA117657086

teepean avatar Oct 04 '25 18:10 teepean

Hi @teepean,

The runtime of strobalign is certainly impressive, but I am not sure this writeup is comparing the right things for the idea that strobalign is well suited for use with aDNA to be convincing.

Going from 10M to 18M mapped reads raises potential alarm bells in my mind, because the usual issue with aDNA is not to find enough DNA to map, but to authenticate those reads as real aDNA, and real (e.g.) human reads. Moreover, the fact that with almost 2x the mapped reads the SNP coverage only increases by +0.75% does not fill me with confidence about the authenticity of the additional reads. In fact, the very reason we use bwa-aln with -l 1024 is to disable seeding, and only use true alignment matches, thus minimising cross-mapping reads.

Also, the fact that bwa is configured to use true matching makes the runtime comparison a bit uneven. The strobalign workflow you are using is much more comparable to bwa-aln with a -l 16, from what I can gleam from the documentation.

One thing that is a bit unclear from your writeup is how the genotyping was done. Are you using a random-draw/haploid approach? I think validating the results coming from those genotypes would be more useful overall than looking at the numbers of resulting reads/duplicates directly. In my experience, if the analysed reads contain more contaminant/microbial reads, the resulting genotypes will affect the positioning of the individual on PCA. As an initial check, it would be good to try projecting the created genotypes on the same PCA as they did in the paper, and compare the positioning between the original bwa-aln dataset, and the strobalign dataset.

TCLamnidis avatar Oct 06 '25 11:10 TCLamnidis

Just to also follow up with some technical issues:

While very interested in seeing if strobealign can be adapted to work with aDNA [^1], even if the approach is working - we would need any changes/adaptations merged into strobealign and the aligner 'officially released' with the changes before we could include it in nf-core/eager.

We need the official bioconda recipe updated and an associated biocontainer before we could include it in the pipeline, we very strongly discourage custom scripts and forks etc as this as adds extra burden on us.

So I would be in the long-run interested in seeing this @teepean , I would suggest you instead port the changes as a PR into the main strobealign repo first, and once that's in we can consider inclusion into eager :)

[^1]: this would be useful for me for nf-core/mag because a Tool I want in there [coverm] uses strobealign internally!

jfy133 avatar Oct 06 '25 12:10 jfy133

These are the commands used to determine and compare the quality of the reads.

samtools flagstat: Basic mapping statistics (total reads, mapped reads, percentages)

samtools stats: Comprehensive quality metrics (error rate, average length, base quality, total bases mapped)

MAPQ distribution: samtools view | cut -f5 | sort -n | uniq -c for mapping quality assessment

Coverage analysis: samtools depth | awk for mean depth and total bases covered

sambamba markdup: -t 16 for PCR duplicate removal

samtools mpileup: -B -q 30 -Q 30 -l sites.pos -f reference.fa for pileup generation

pileupCaller: --randomHaploid --sampleNames --samplePopName -f snp_panel.snp -e output_prefix for genotype calling on 1240K panel

@jfy133 the folks over at strobealign will discuss this subject when they have their next meeting

teepean avatar Oct 07 '25 05:10 teepean

Nice! Out of curiosity, where did you get shrimp from?

I've honestly never heard of it, let alone people using it for aDNA.

To my knowledge 95% of people use bwa aln or bowtie2

jfy133 avatar Oct 07 '25 17:10 jfy133

I was reading Joshua Rubin's preprint SAFARI: Pangenome Alignment of Ancient DNA Using Purine/Pyrimidine Encodings and it mentioned SHRiMP "SHRiMP is quite a sensitive alignment tool for aDNA samples but is unfortunately not maintained since 2014"

https://www.biorxiv.org/content/10.1101/2024.08.12.607489v2.full

teepean avatar Oct 07 '25 17:10 teepean

I was reading Joshua Rubin's preprint SAFARI: Pangenome Alignment of Ancient DNA Using Purine/Pyrimidine Encodings and it mentioned SHRiMP "SHRiMP is quite a sensitive alignment tool for aDNA samples but is unfortunately not maintained since 2014"

https://www.biorxiv.org/content/10.1101/2024.08.12.607489v2.full

Huh interesting... I'll have to read that preprint again!

jfy133 avatar Oct 08 '25 10:10 jfy133