NanoSim icon indicating copy to clipboard operation
NanoSim copied to clipboard

Reading the simulated_aligned_reads

Open Evandio-Martin opened this issue 1 year ago • 3 comments

I want to analyze the output of NanoSim based on simulated_aligned_reads and compare it with the input of the human reference genome from GRCh37 and using the pre-trained human guppy model provided from NanoSim.

  1. I have question on how to read this

NC-000011_21773883_aligned_2_F_2_2258_40

  • based on the readme file, 21773883 is the start position. does it mean the character index from the top left right of the input? meaning we should start counting from the NNNNN?
  • 2 is the sequence index. I don't understand this part. How many lines is in each sequences?
  1. Last question, is it possible to compare the input and output to check the difference from the NanoSim outputs?

Thank you very much

Evandio-Martin avatar Jun 04 '24 06:06 Evandio-Martin

Thanks for your interest in using NanoSim @Evandio-Martin

  • Start position is the start index on the reference. If it is a genomic read simulation, it is a random position on the reference genome where NanoSim extracts the reads from. In your example, 21773883 is the pythonic start position on that chromosome.
  • Sequence index is a unique identifier for the sequence generated.

I did not get your second question. What do you mean by input and output? Did you train a NanoSim model yourself or did you use a pre-trained model? If you used a pre-trained model, by "input" do you mean the reference genome used for the simulation? In what aspects do you want to compare the reference genome and simulated reads?

saberhq avatar Jun 06 '24 02:06 saberhq

Thank you very much for you answer,

For the second question, that's right, I'm using the pre-trained model. I refer to the GRCh37 reference genome as the input and I'm analyzing the simulated_aligned_reads as the output. I don't know if my thinking is right or not but I'm using the start position of the simulated_aligned_reads of certain chromosome and then use that start position on the reference genome of that certain chromosome. From there, I am comparing between these two to check how much is the difference from input to the output of the ./simulator.py.

I tried this one but it turns out it is totally different so I thought that how I read the start position is wrong because I'm not sure we count the start position from the Ns or after the Ns of the chromosome or not.

And I cannot analyze from the sequence index because I don't know when does the sequence start or stop because there is no sequence separation from start until end.

Thank you very much.

Evandio-Martin avatar Jun 06 '24 05:06 Evandio-Martin

Ah, sorry. I just realized about how the sequence index works. So that means if I want to analyze 1 chromosome. The sequence index doesn't matter because it's just an identifier for each sequence right? So, I should only focus on the start position

Evandio-Martin avatar Jun 06 '24 05:06 Evandio-Martin