NanoSim icon indicating copy to clipboard operation
NanoSim copied to clipboard

Whether NanoSim could generate Nanopore reads by simulation

Open Jingquan-Li opened this issue 4 years ago • 12 comments

Hi @cheny19, I noticed the discription of NanoSim was a Nanopore sequence read simulator. And I wonder harness a software to generate some Nanopore reads with given a genome or a fasta file . When I looking into the scripts of NanoSim , I failed to find such a script . I really hope for your help!

Thanks.

Jingquan-Li avatar Jun 18 '20 07:06 Jingquan-Li

Yes, you just need to run simulator.py to simulate ONT reads. You can find the help info in the README.md file.

cheny19 avatar Jun 18 '20 17:06 cheny19

Yes, you just need to run simulator.py to simulate ONT reads. You can find the help info in the README.md file.

If I just run simulator.py to simulate ONT reads without runing step one , I encountered this error : simulator.py genome -dna_type linear -rg 1M_12501.fa -c ssc_1M -max 90000 -min 20000 -n 1000 -t 6 Traceback (most recent call last): File "/home/huangtao/LJQ/conda/envs/metawrap-env/bin/simulator.py", line 1513, in main() File "/home/huangtao/LJQ/conda/envs/metawrap-env/bin/simulator.py", line 1422, in main read_profile(ref_g, None, number, model_prefix, perfect, args.mode, strandness, None, False, dna_type) File "/home/huangtao/LJQ/conda/envs/metawrap-env/bin/simulator.py", line 270, in read_profile with open(model_prefix + "_strandness_rate", 'r') as strand_profile: IOError: [Errno 2] No such file or directory: 'ssc_1M_strandness_rate'

And I noticed only did I run read_analysis.py then could obtain the strandness_rate file.
So it confused me.

Jingquan-Li avatar Jun 19 '20 02:06 Jingquan-Li

Right, if you want to use your own model, you have to run step1 first. However, if you don't want to train your own model, you can direct -c to our pre-trained model (provided in the package), and run simulator.py. You just need to untar the pre-trained model, and specify the directory and prefix to -c option.

cheny19 avatar Jun 19 '20 03:06 cheny19

Right, if you want to use your own model, you have to run step1 first. However, if you don't want to train your own model, you can direct -c to our pre-trained model (provided in the package), and run simulator.py. You just need to untar the pre-trained model, and specify the directory and prefix to -c option. I downloaded the human_NA12878_DNA_FAB49712_albacore.tar.gz you provided, then I run tar -xvzf human_NA12878_DNA_FAB49712_albacore.tar.gz` such a error occured: gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error is not recoverable: exiting now

Jingquan-Li avatar Jun 19 '20 03:06 Jingquan-Li

It happened to me yesterday as well. You'll need to clone the whole repo, or click into the pretrained model folder from Github and then click the model you want to use to download. It seems Github has some sort of issue that the file is broken if you right click to download directly.

cheny19 avatar Jun 19 '20 18:06 cheny19

Thanks for your patient guideness! I have downloaded your trianed model. I encountered errors when I run ./NanoSim2.6.0/simulator.py genome -dna_type linear -rg 1M_12501.fa -c human_NA12878_DNA_FAB49712_albacore/training -max 90000 -min 20000 -n 1000

Traceback (most recent call last): File "./NanoSim2.6.0/simulator.py", line 1702, in main() File "./NanoSim2.6.0/simulator.py", line 1599, in main read_profile(ref_g, None, number, model_prefix, perfect, args.mode, strandness, None, False, dna_type, None) File "./NanoSim2.6.0/simulator.py", line 411, in read_profile kde_unaligned = joblib.load(model_prefix + "_unaligned_length.pkl") File "/home/huangtao/LJQ/conda/envs/metawrap-env/lib/python2.7/site-packages/joblib/numpy_pickle.py", line 605, in load obj = _unpickle(fobj, filename, mmap_mode) File "/home/huangtao/LJQ/conda/envs/metawrap-env/lib/python2.7/site-packages/joblib/numpy_pickle.py", line 529, in _unpickle obj = unpickler.load() File "/home/huangtao/LJQ/conda/envs/metawrap-env/lib/python2.7/pickle.py", line 864, in load dispatchkey File "/home/huangtao/LJQ/conda/envs/metawrap-env/lib/python2.7/pickle.py", line 892, in load_proto raise ValueError, "unsupported pickle protocol: %d" % proto ValueError: unsupported pickle protocol: 3 It seems to that (model_prefix + "_unaligned_length.pkl") file was generated by Python3, but I loaded the data by Python2.7

Jingquan-Li avatar Jun 20 '20 02:06 Jingquan-Li

This issue has been reported by other users in #81 , could you try from sklearn.externals import joblib instead of import joblib and see if it occurs?

cheny19 avatar Jun 20 '20 21:06 cheny19

Hi @Leejquan, We have an update on this issue. we have finally finished all the coding and testing to change the way of importing model files. We also have re-trained all the models, so we hope this problem is resolved in NanoSim v3.0.0 pre-release. Please give it a shot and let me know how it works for you. Thanks for waiting for so long.

cheny19 avatar Nov 19 '20 14:11 cheny19

You just need to untar the pre-trained model, and specify the directory and prefix to -c option.

Could you update the readme example to include this information? Just trying human as written currently doesn't seem to work.

(Ideally, a path to somewhere inside the conda installation would be best, assuming they are already part of this.)

RagnarGrootKoerkamp avatar Apr 21 '22 09:04 RagnarGrootKoerkamp

You just need to untar the pre-trained model, and specify the directory and prefix to -c option.

Could you update the readme example to include this information? Just trying human as written currently doesn't seem to work.

(Ideally, a path to somewhere inside the conda installation would be best, assuming they are already part of this.)

Please note that -c option in simulation stage specifies the location and prefix of error profiles generated from characterization step (Default = training). That human thing you mentioned from README file is a symbolic name referencing the trained models on human data.

-c MODEL_PREFIX, --model_prefix MODEL_PREFIX

For more information on parameters for each mode in training and simulation stage, you may run: read_analysis.py -h or simulator.py -h. There are five modes in read_analysis.py and three modes in simulator.py.

I will take a note to update the README file to make it clear.

SaberHQ avatar Apr 21 '22 18:04 SaberHQ

Hell, nowadays I want to simulate some ONT reads from bacteria and virus genomes. However, I notice that your latest pre-trained models are trained on the human datasets, which may have different sequence patterns compared to bacteria ones. I am wondering, which pre-trained model should I use to get acceptable simulation results on my dataset?

zhanghaoyu9931 avatar May 06 '22 04:05 zhanghaoyu9931

Hey @zhanghaoyu9931 I would highly recommend you to train your own model and use the trained profiles to simulate reads.

The README file is very informative and it will guide you through on how to run the training pipeline. It's fast and does not require high computing power. Please refer to following code for more information:

https://github.com/bcgsc/NanoSim/blob/master/src/read_analysis.py

SaberHQ avatar May 06 '22 19:05 SaberHQ