NanoSim
NanoSim copied to clipboard
Reads come mostly from one chromosome only
Hello - I'm simulating reads from Leishmania major genome using training reads sequenced from this genome:
read_analysis.py genome -o training -i fastq_runid_ae10a0b01321b03612390f5e1e5fe2b0d834516a_0_0.fastq.gz \
-rg TriTrypDB-45_LmajorFriedlin_Genome.fasta -t 16
simulator.py genome -c training -n 20000 -t 32 -b guppy -rg TriTrypDB-45_LmajorFriedlin_Genome.fasta
After mapping the simulated reads back to the reference genome, I find that most reads come from one chromosome only:
minimap2 -a -x map-ont -t 12 TriTrypDB-45_LmajorFriedlin_Genome.fasta simulated_aligned_reads.fasta simulated_unaligned_reads.fasta \
| samtools sort > simulated.bam
samtools index simulated.bam
samtools idxstats simulated.bam | column -t
LmjF.01 268988 0 0
LmjF.02 355712 15 0
LmjF.03 384502 0 0
LmjF.04 472852 0 0
LmjF.05 465823 1 0
LmjF.06 516869 0 0
LmjF.07 596352 14 0
LmjF.08 574960 0 0
LmjF.09 573434 0 0
LmjF.10 570865 0 0
LmjF.11 582573 1 0
LmjF.12 675346 0 0
LmjF.13 654595 0 0
LmjF.14 622644 2 0
LmjF.15 629517 0 0
LmjF.16 714651 0 0
LmjF.17 684829 0 0
LmjF.18 739748 0 0
LmjF.19 702208 0 0
LmjF.20 742537 0 0
LmjF.21 772972 22 0
LmjF.22 716602 0 0
LmjF.23 772565 0 0
LmjF.24 840950 0 0
LmjF.25 912845 23 0
LmjF.26 1091540 0 0
LmjF.27 1130424 2 0
LmjF.28 1160104 0 0
LmjF.29 1212663 0 0
LmjF.30 1403434 0 0
LmjF.31 1484328 5 0
LmjF.32 1604637 0 0
LmjF.33 1583653 1 0
LmjF.34 1866748 86 0
LmjF.35 2090474 4 0
LmjF.36 2682151 12218 0
* 0 0 7856
The training reads look reasonably mapped:
minimap2 -a -x map-ont -t 12 TriTrypDB-45_LmajorFriedlin_Genome.fasta fastq_runid_ae10a0b01321b03612390f5e1e5fe2b0d834516a_0_0.fastq.gz \
| samtools sort > out.bam
samtools index out.bam
samtools idxstats out.bam | column -t
LmjF.01 268988 24 0
LmjF.02 355712 24 0
LmjF.03 384502 30 0
LmjF.04 472852 33 0
LmjF.05 465823 30 0
LmjF.06 516869 34 0
LmjF.07 596352 41 0
LmjF.08 574960 105 0
LmjF.09 573434 45 0
LmjF.10 570865 42 0
LmjF.11 582573 43 0
LmjF.12 675346 173 0
LmjF.13 654595 44 0
LmjF.14 622644 67 0
LmjF.15 629517 92 0
LmjF.16 714651 63 0
LmjF.17 684829 48 0
LmjF.18 739748 61 0
LmjF.19 702208 67 0
LmjF.20 742537 55 0
LmjF.21 772972 56 0
LmjF.22 716602 57 0
LmjF.23 772565 42 0
LmjF.24 840950 79 0
LmjF.25 912845 72 0
LmjF.26 1091540 88 0
LmjF.27 1130424 213 0
LmjF.28 1160104 89 0
LmjF.29 1212663 63 0
LmjF.30 1403434 292 0
LmjF.31 1484328 254 0
LmjF.32 1604637 128 0
LmjF.33 1583653 155 0
LmjF.34 1866748 224 0
LmjF.35 2090474 198 0
LmjF.36 2682151 237 0
* 0 0 1570
This is with NanoSim 3.1.0.
Any idea why I'm getting reads from virtually one chromosome only?
Sorry for this late reply @dariober
It is indeed very weird behaviour by NanoSim which I have never seen before. Have you tried using pre-trained models as well? Is the issue remains in that case?
It would be also insightful to investigate the output files from the training stage. This will help us know whether the issue is related to the training process or the simulation stage. For example, you may check the length distributions of aligned reads or the histogram of match, mismatch, and indels.