ska.rust icon indicating copy to clipboard operation
ska.rust copied to clipboard

How to best call variants with ska lo?

Open rrwick opened this issue 9 months ago • 2 comments

Hello, and thank you for developing SKA and the new ska lo subcommand!

I'm interested in using SKA to call variants against a reference using a single assembly. This is what I've done previously (in this preprint) using ska map:

ska build -o ska.skf -k 31 assembly.fasta
ska weed --filter no-ambig ska.skf
ska map reference.fasta ska.skf -f vcf | bcftools view -e'ALT="."' > variants.vcf

This works well for most SNPs but not for closely-spaced SNPs or indels. But then @johnlees let me know about ska lo, which sounds like it could solve these shortcomings! But I can't quite figure out how to get the VCF I need using ska lo. These commands almost work:

ska build -o ska.skf -k 31 assembly.fasta reference.fasta
ska lo ska.skf test -r reference.fasta

Except I have these two issues:

  1. The indels appear in a separate VCF file that don't have positions.
  2. ska lo only seems to allow a single sequence in the reference, so it errors out when reference.fasta contains multiple sequences (e.g. a chromomsome and plasmids).

Based on the documentation, the first issue seems to be an inherent limitation of ska lo, is that right?

For the second issue, the best solution I've found is to run ska lo separately on each reference sequence and then merge the results together. This is what I've come up with:

samtools faidx reference.fasta
ska build -o ska.skf -k 31 assembly.fasta reference.fasta

# Get reference seq names
seqs=($(grep ">" "$ref" | sed 's/>//; s/ .*//'))

# Run ska lo using each reference sequence
for seq in "${seqs[@]}"; do
    ska lo ska.skf "$seq" -r <(samtools faidx reference.fasta "$seq")
done

# Reheader and concatenate VCFs
vcf_files=()
for seq in "${seqs[@]}"; do
    bcftools reheader -f "${ref}.fai" "${seq}_snps.vcf" > "${seq}_snps_fixed.vcf"
    vcf_files+=("${seq}_snps_fixed.vcf")
done

# Merge all VCFs
bcftools concat -o variants.vcf "${vcf_files[@]}"

It's clunky, but it seems to work. Is there a better way?

Thanks! Ryan

rrwick avatar Mar 05 '25 22:03 rrwick

Hi Ryan,

Thanks for your message. The command 'ska lo' calls variants in reference-free mode and only subsequently attempts to position SNPs on a reference genome, mostly designed for recombination analyses. So, as opposed to other methods that call variants directly from the reference genome, its SNP positioning is prone to errors (which does not necessarily imply that the SNPs are incorrect). More details here: https://www.biorxiv.org/content/10.1101/2024.10.02.616334v2

The current variant positioning is clearly limited, mostly due to a lack of time, and I/someone will have at some point to improve it. You are correct regarding some of its limitations:

  • no indel positioning: it can be done in the future but will probably require to profoundly modify the positioning function as indel positioning is tricky.
  • only 1 reference sequence allowed: this limitation shouldn't exist ... it will be corrected in future versions.

Regarding quick fixes:

  • indels: I could provide you with a Python script that combines the insert with left and right sequences ('before' and 'after' in indel vcf) and then Blast these sequences on a reference genome for positioning. I used this approach in the 1st version of the skalo paper so must have such script somewhere and could modify/test it. Please let me know if you need it.
  • multiple reference sequences: yes, your approach should work. I wouldn't use it in the case of a fragmented genome assembly, with the risk of a given SNP being positioned multiple times independently. But I trust this approach will work here with a full assembly and two small plasmids. Or alternatively you could remove the two small plasmids from the reference and use the standard ska lo command line -> a few more false negatives for ska lo, which would be fair.

All that said, ska map might be sufficient to address the question sequencing reads vs assemblies (I really enjoyed reading your preprint!).

best, Romain

rderelle avatar Mar 06 '25 08:03 rderelle

We should fix the multi-contig reference, the code in ska map does this and we should be able to port it over to ska lo

johnlees avatar Mar 06 '25 17:03 johnlees