shovill
shovill copied to clipboard
Runs with identical parameters and data don't yield identical assemblies
Hi,
I am in the process of validating a pipeline I wrote, and which includes Shovill for bacterial genome assembly.
My assumption was that I could md5sum result files between runs of identical inputs to demonstrate the reproducibility of the overall workflow. Which is when I noticed that the Shovill assemblies (same data, same settings, same everything - Shovill via container etc) are in fact not identical. The differences are not vast in terms of the informational content, but the smaller contigs (<500bp) differ slightly in things like name, coverage, etc.
I am trying to understand why that is and how to deal with it. I noticed that using a cutoff for minimum contig size of 600bp largely eliminates the issue. And I suppose one could make an argument for why that is probably "OK" (even the smallest plasmids are bigger than 600bp), but then again... not an ideal solution.
Any way to nudge Shovill to be perfectly reproducible or is there really some element of randomness in the assembly process?
Cheers Marc
Test data used as input: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR580/ERR580964/ERR580964_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR580/ERR580964/ERR580964_2.fastq.gz
Settings:
--assembler spades --minlen 300
--cpus 6
--ram 25
--outdir ./ \
These settings are for a github test workflow so the resources are constrained by the available runners there.
To add, I ran Spades standalone (3.15), with no resource limits as well as 6cpus/25Gb ram - and the scaffolds.fasta was identical between the runs. So it does seem to be a Shovill thing (related to Pilon maybe?).
After some md5sum'ing, the divergence occurs during the cleaning up of the contig names. Something about the hash sorting randomly returning sequences of equal length, I suppose. Just looking at the sequence headers in the order they appear in the fasta file:
==> 1.txt <==
>contig00232 len=322 cov=69.2 corr=0 origname=NODE_237_length_322_cov_69.164103_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00233 len=318 cov=41.3 corr=0 origname=NODE_238_length_318_cov_41.287958_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00234 len=317 cov=73.4 corr=0 origname=NODE_239_length_317_cov_73.405263_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00235 len=316 cov=65.1 corr=0 origname=NODE_240_length_316_cov_65.111111_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00236 len=312 cov=40.2 corr=0 origname=NODE_241_length_312_cov_40.216216_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00237 len=310 cov=101.3 corr=0 origname=NODE_242_length_310_cov_101.311475_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00238 len=309 cov=60.7 corr=0 origname=NODE_243_length_309_cov_60.714286_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00239 len=309 cov=53.6 corr=0 origname=NODE_244_length_309_cov_53.620879_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00240 len=306 cov=579.5 corr=0 origname=NODE_245_length_306_cov_579.508380_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00241 len=306 cov=55.8 corr=0 origname=NODE_246_length_306_cov_55.787709_pilon sw=shovill-spades/1.1.0 date=20250618
==> 2.txt <==
>contig00232 len=322 cov=69.2 corr=0 origname=NODE_237_length_322_cov_69.164103_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00233 len=318 cov=41.3 corr=0 origname=NODE_238_length_318_cov_41.287958_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00234 len=317 cov=73.4 corr=0 origname=NODE_239_length_317_cov_73.405263_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00235 len=316 cov=65.1 corr=0 origname=NODE_240_length_316_cov_65.111111_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00236 len=312 cov=40.2 corr=0 origname=NODE_241_length_312_cov_40.216216_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00237 len=310 cov=101.3 corr=0 origname=NODE_242_length_310_cov_101.311475_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00238 len=309 cov=53.6 corr=0 origname=NODE_244_length_309_cov_53.620879_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00239 len=309 cov=60.7 corr=0 origname=NODE_243_length_309_cov_60.714286_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00240 len=306 cov=55.8 corr=0 origname=NODE_246_length_306_cov_55.787709_pilon sw=shovill-spades/1.1.0 date=20250618
>contig00241 len=306 cov=579.5 corr=0 origname=NODE_245_length_306_cov_579.508380_pilon sw=shovill-spades/1.1.0 date=20250618
Sadly, this cannot be easily fixed "after the fact" (short of renaming the contigs again). My perl is a little rusty, but constructing a hash of hashes with both the sequence length and coverage seems like a possible solution.