NanoSim Reads come mostly from one chromosome only

Reads come mostly from one chromosome only

Open dariober opened this issue 2 years ago • 1 comments

trafficstars

Hello - I'm simulating reads from Leishmania major genome using training reads sequenced from this genome:

read_analysis.py genome -o training -i fastq_runid_ae10a0b01321b03612390f5e1e5fe2b0d834516a_0_0.fastq.gz \
    -rg TriTrypDB-45_LmajorFriedlin_Genome.fasta -t 16

simulator.py genome -c training -n 20000 -t 32 -b guppy -rg TriTrypDB-45_LmajorFriedlin_Genome.fasta

After mapping the simulated reads back to the reference genome, I find that most reads come from one chromosome only:

minimap2 -a -x map-ont -t 12 TriTrypDB-45_LmajorFriedlin_Genome.fasta simulated_aligned_reads.fasta simulated_unaligned_reads.fasta \
| samtools sort > simulated.bam
samtools index simulated.bam

samtools idxstats simulated.bam | column -t
LmjF.01  268988   0      0
LmjF.02  355712   15     0
LmjF.03  384502   0      0
LmjF.04  472852   0      0
LmjF.05  465823   1      0
LmjF.06  516869   0      0
LmjF.07  596352   14     0
LmjF.08  574960   0      0
LmjF.09  573434   0      0
LmjF.10  570865   0      0
LmjF.11  582573   1      0
LmjF.12  675346   0      0
LmjF.13  654595   0      0
LmjF.14  622644   2      0
LmjF.15  629517   0      0
LmjF.16  714651   0      0
LmjF.17  684829   0      0
LmjF.18  739748   0      0
LmjF.19  702208   0      0
LmjF.20  742537   0      0
LmjF.21  772972   22     0
LmjF.22  716602   0      0
LmjF.23  772565   0      0
LmjF.24  840950   0      0
LmjF.25  912845   23     0
LmjF.26  1091540  0      0
LmjF.27  1130424  2      0
LmjF.28  1160104  0      0
LmjF.29  1212663  0      0
LmjF.30  1403434  0      0
LmjF.31  1484328  5      0
LmjF.32  1604637  0      0
LmjF.33  1583653  1      0
LmjF.34  1866748  86     0
LmjF.35  2090474  4      0
LmjF.36  2682151  12218  0
*        0        0      7856

The training reads look reasonably mapped:

minimap2 -a -x map-ont -t 12 TriTrypDB-45_LmajorFriedlin_Genome.fasta fastq_runid_ae10a0b01321b03612390f5e1e5fe2b0d834516a_0_0.fastq.gz \
| samtools sort > out.bam
samtools index out.bam

samtools idxstats out.bam | column -t
LmjF.01  268988   24   0
LmjF.02  355712   24   0
LmjF.03  384502   30   0
LmjF.04  472852   33   0
LmjF.05  465823   30   0
LmjF.06  516869   34   0
LmjF.07  596352   41   0
LmjF.08  574960   105  0
LmjF.09  573434   45   0
LmjF.10  570865   42   0
LmjF.11  582573   43   0
LmjF.12  675346   173  0
LmjF.13  654595   44   0
LmjF.14  622644   67   0
LmjF.15  629517   92   0
LmjF.16  714651   63   0
LmjF.17  684829   48   0
LmjF.18  739748   61   0
LmjF.19  702208   67   0
LmjF.20  742537   55   0
LmjF.21  772972   56   0
LmjF.22  716602   57   0
LmjF.23  772565   42   0
LmjF.24  840950   79   0
LmjF.25  912845   72   0
LmjF.26  1091540  88   0
LmjF.27  1130424  213  0
LmjF.28  1160104  89   0
LmjF.29  1212663  63   0
LmjF.30  1403434  292  0
LmjF.31  1484328  254  0
LmjF.32  1604637  128  0
LmjF.33  1583653  155  0
LmjF.34  1866748  224  0
LmjF.35  2090474  198  0
LmjF.36  2682151  237  0
*        0        0    1570

This is with NanoSim 3.1.0.

Any idea why I'm getting reads from virtually one chromosome only?

Dec 19 '22 18:12 dariober

Sorry for this late reply @dariober

It is indeed very weird behaviour by NanoSim which I have never seen before. Have you tried using pre-trained models as well? Is the issue remains in that case?

It would be also insightful to investigate the output files from the training stage. This will help us know whether the issue is related to the training process or the simulation stage. For example, you may check the length distributions of aligned reads or the histogram of match, mismatch, and indels.

Feb 25 '23 02:02 saberhq

NanoSim NanoSim copied to clipboard

Reads come mostly from one chromosome only

NanoSim
NanoSim copied to clipboard