NanoSim
NanoSim copied to clipboard
Stuck at simulation stage
Hi, I am trying to use NanoSim 3.0.0. (downloaded source code from (https://github.com/bcgsc/NanoSim/releases/tag/v3.0.0)) for simulating ONT reads. I created my own profile using: ./read_analysis.py transcriptome -i ont_run.fq -rg ucsc_hg38.fasta -rt hg38_transcriptome_gff.fa -annot hg38_transcriptome.gff3 -t 38
. It seems to run fine with the following output:
2023-05-30 15:21:23: Read pre-process and unaligned reads analysis 2023-05-30 15:21:39: Alignment with minimap2 to reference transcriptome 2023-05-30 15:24:33: Alignment with minimap2 to reference genome 2023-05-30 16:00:19: Processing transcriptome alignment file: sam 2023-05-30 16:00:44: Processing genome alignment file: sam 2023-05-30 16:01:14: Parse the annotation file (GTF/GFF3) 2023-05-30 16:01:47: Modeling Intron Retention 2023-05-30 16:01:47: Reading intron coordinates from GFF file 2023-05-30 16:02:30: Read primary genome alignment for each read 2023-05-30 16:07:03: Read primary transcriptome alignment for each read 2023-05-30 16:09:49: Calculating probabilities for each intron retention event 2023-05-30 16:16:23: Aligned reads analysis 2023-05-30 16:16:40: Computing KDEds processed >> 1215001 2023-05-30 16:16:40: Computing 2D KDE for transcriptome ref length 2023-05-30 16:16:41: Computing KDE for aligned region 2023-05-30 16:16:41: Computing KDE for aligned reads 2023-05-30 16:16:42: Computing KDE for unaligned length 2023-05-30 16:16:42: Computing KDE for ht ratio 2023-05-30 16:16:43: Unaligned reads analysis 2023-05-30 16:16:43: match and error models 2023-05-30 16:19:13: Model fitting 2023-05-30 16:19:18: Mismatch fitting start Mismatch parameters: 0.10299333509559308 0.787357964367962 0.2182370542984321 Residual 6.509615946814762e-05 2023-05-30 16:19:48: Mismatch fitting done 2023-05-30 16:19:48: Insertion fitting start WARNING! Insertion parameters may not be optimal! 0.8374719781176181 0.9192989587536269 0.008942175282069826 0.9880211696155594 Residual 0.0030656834500610852 2023-05-30 16:21:14: Insertion fitting done 2023-05-30 16:21:14: Deletion fitting start WARNING! Deletion parameters may not be optimal! 0.9569706479182295 1.032499061802694 0.2866297773531245 0.8752291450180834 Residual 0.0005721754355394459 2023-05-30 16:22:00: Deletion fitting done 2023-05-30 16:22:00: Finished!
But when I try to run simulator using ./simulator.py transcriptome -rt hg38_transcriptome_gff.fa -rg ucsc_hg38.fasta -c training -e expression_abundance.tsv -n 20000 -max 10000 -min 200 -b guppy -r cDNA_1D --fastq -o simulated_1D
it gets stuck at:
2023-05-31 09:48:19: Read in reference 2023-05-31 09:48:21: Read in expression profile 2023-05-31 09:48:21: Read in reference genome and create .fai index file 2023-05-31 09:48:21: Read in IR markov model 2023-05-31 09:48:21: Read in GFF3 annotation file 2023-05-31 09:48:41: Read error profile 2023-05-31 09:48:41: Read KDF of unaligned reads 2023-05-31 09:48:41: Read KDF of aligned reads 2023-05-31 09:48:41: Read chimeric simulation information 2023-05-31 09:48:41: Start simulation of aligned reads
I have ensured that my expression_abundance.tsv file have transcript_id with "ENS". I would really appreciate if you can help me with this. Thanks.
I think it has to do with these settings: -n 20000 -max 10000 -min 200
It sounds like a similar issue in metagenome mode: https://github.com/bcgsc/NanoSim/issues/184
@SaberHQ Possibly a bug in transcriptome mode?
Thanks for reporting this @kainth-amoldeep
@kmnip I do not think it is related to the infinite loop bug you mentioned (#184) because, in the transcriptome mode, NanoSim first selects a reference transcriptome based on the expression profiles and then based on the 2 dimensional KDE distributions, it picks an aligned length for the simulation. Therefore, max and min lengths are not used. They are only used in generating unaligned reads I believe.
That being said, dear @kainth-amoldeep would you mind trying again not using the min and max parameters and see if it works or not? In addition, you said you ensured that the transcript ids in the expression file start with "ENS". Can you also double-check if they are the same in your reference transcriptome file?
Did you calculate the expression profiles using the module provided by NanoSim? I mean did you run the NanoSim -quantify module?
@SaberHQ Good points!
@kainth-amoldeep I suggest you can also try our development code on the master branch. We have had a number of bugfixes since v3.0.0.
Hi, Thanks for your comments and suggestions. I tried without min max and it was still stuck. @SaberHQ I made an expression abundance file on the basis of short read sequencing; may something has gone wrong in that in terms of nomenclature. However, I was able to simulate reads on the basis of a pretrained model. So, for now, it serves my purpose. Thanks a lot again.