nanopolish
nanopolish copied to clipboard
Running nanopolish with reproducible results
Hi, I run nanopolish event align three times with the same input. 2/3 of them had been run in the same conda env to ensure that all packages are the same. Their results are very similar, but not exactly the same.
Run_1:
[readdb] num reads: 329833, num reads with path to fast5: 329833
Eventalign
[post-run summary] total reads: 167541, unparseable: 0, qc fail: 775, could not calibrate: 53819, no alignment: 2971, bad fast5: 0
Eventalign file has been created
And,
Run_2_
[readdb] num reads: 329833, num reads with path to fast5: 329833
Eventalign
[post-run summary] total reads: 167540, unparseable: 0, qc fail: 775, could not calibrate: 53819, no alignment: 2971, bad fast5: 0
Eventalign file has been created
- The eventalign outputs have the same line count (13G). However their content (the order of the kmers) is different.. I wonder why total reads are not exactly the same and why not I get the exact same file as eventalign output.
When I compared these runs to another run where I use a different conda env, with the same raw input which was run with a nanopolish before, then nanopolish eventalign output is 41G..
With the same input, how is this possible? Did results accumulate?
Can I make sure that I can replicate my results? With a seed?
My command line is :
./nanopolish eventalign --reads $path_to_main/fastq_pass/T_convert_Merged_$library_name.fa --bam $minimap_output_path/sorted_Primary-Alignments_T_convert_$library_name.bam -t 20 --genome $minimap_reference --summary=$eventalign_output/summary_$library_name"_eventalign".txt --scale-events > $eventalign_output/eventalign_results_of_$library_name.txt
I would be glad to hear back from you. Thanks,
For 1) the variables used in the final status line ([post-run summary] num reads:
) are not protected by a mutex, so the numbers may vary a little when running with multiple threads. The output .tsv file is protected with a mutex, so you should always have the same number of records, but the order each batch of reads is written to the output file depends on the order in which the threads complete. If the ordering difference is a problem for your workflow you could (carefully) sort the results to have a consistent order.
When I compared these runs to another run where I use a different conda env, with the same raw input which was run with a nanopolish before, then nanopolish eventalign output is 41G..
This one is harder for me to answer. I wouldn't expect that drastic of a difference. I'll need more information to help with this I think (program versions, sizes of the input bam files, etc)
Thank you very much for your reply. I see..In this case,can you please explain why I have different line counts in the .tsv file, when I ran nanopolish on the same library one time, or multiple times? On the same raw (fast5 and fastq) files, if I ran nanopolish multiple times, the output .tsv file is 41GB. If it is ran one time, then tsv file is 13GB.. For the 41GB file, I just ran the same raw files again, and now the output is as follows:
[readdb] num reads: 329833, num reads with path to fast5: 329833
Eventalign
[post-run summary] total reads: 502581, unparseable: 0, qc fail: 2325, could not calibrate: 161457, no alignment: 8913, bad fast5: 0
Eventalign file has been created
It is ~3 times more than the eventalign
file generated by the first run...
I'd be glad to hear your insights, thank you
I can't really say without seeing the data and having an example to debug. Is it possible the .bam file was incomplete when nanopolish eventalign was started in the 13GB case?
I don't think so because in my pipeline, nanopolish starts after minimap2 outputs its results. I ran the nanopolish on this raw data 3 separate times to ensure that 13GB is the correct size. Also, the number of reads going into the nanopolish eventalign is the same in between single run or multiple run of nanopolish on the same library (329833), therefore I believe incomplete .bam is not the reason.
Based on my understanding, you say nanopolish should give the same .tsv file on the raw data in every run? Doesn't matter the first run or the fifth? I can try to share the data. Would you need the piece of bam file and fast5 and fastq of the reads?
Could you check whether the 41GB file contains any duplicated data? I suggest grepping for a single read name in each file to see whether the number of lines match in each case.