nanopolish icon indicating copy to clipboard operation
nanopolish copied to clipboard

Running nanopolish with reproducible results

Open kaltinel opened this issue 2 years ago • 5 comments

Hi, I run nanopolish event align three times with the same input. 2/3 of them had been run in the same conda env to ensure that all packages are the same. Their results are very similar, but not exactly the same.

Run_1:
[readdb] num reads: 329833, num reads with path to fast5: 329833
Eventalign
[post-run summary] total reads: 167541, unparseable: 0, qc fail: 775, could not calibrate: 53819, no alignment: 2971, bad fast5: 0
Eventalign file has been created

And,

Run_2_
[readdb] num reads: 329833, num reads with path to fast5: 329833
Eventalign
[post-run summary] total reads: 167540, unparseable: 0, qc fail: 775, could not calibrate: 53819, no alignment: 2971, bad fast5: 0
Eventalign file has been created
  1. The eventalign outputs have the same line count (13G). However their content (the order of the kmers) is different.. I wonder why total reads are not exactly the same and why not I get the exact same file as eventalign output.

When I compared these runs to another run where I use a different conda env, with the same raw input which was run with a nanopolish before, then nanopolish eventalign output is 41G..

With the same input, how is this possible? Did results accumulate? Can I make sure that I can replicate my results? With a seed? My command line is : ./nanopolish eventalign --reads $path_to_main/fastq_pass/T_convert_Merged_$library_name.fa --bam $minimap_output_path/sorted_Primary-Alignments_T_convert_$library_name.bam -t 20 --genome $minimap_reference --summary=$eventalign_output/summary_$library_name"_eventalign".txt --scale-events > $eventalign_output/eventalign_results_of_$library_name.txt

I would be glad to hear back from you. Thanks,

kaltinel avatar Aug 17 '22 15:08 kaltinel

For 1) the variables used in the final status line ([post-run summary] num reads:) are not protected by a mutex, so the numbers may vary a little when running with multiple threads. The output .tsv file is protected with a mutex, so you should always have the same number of records, but the order each batch of reads is written to the output file depends on the order in which the threads complete. If the ordering difference is a problem for your workflow you could (carefully) sort the results to have a consistent order.

When I compared these runs to another run where I use a different conda env, with the same raw input which was run with a nanopolish before, then nanopolish eventalign output is 41G..

This one is harder for me to answer. I wouldn't expect that drastic of a difference. I'll need more information to help with this I think (program versions, sizes of the input bam files, etc)

jts avatar Aug 17 '22 20:08 jts

Thank you very much for your reply. I see..In this case,can you please explain why I have different line counts in the .tsv file, when I ran nanopolish on the same library one time, or multiple times? On the same raw (fast5 and fastq) files, if I ran nanopolish multiple times, the output .tsv file is 41GB. If it is ran one time, then tsv file is 13GB.. For the 41GB file, I just ran the same raw files again, and now the output is as follows:

[readdb] num reads: 329833, num reads with path to fast5: 329833
Eventalign
[post-run summary] total reads: 502581, unparseable: 0, qc fail: 2325, could not calibrate: 161457, no alignment: 8913, bad fast5: 0
Eventalign file has been created

It is ~3 times more than the eventalign file generated by the first run... I'd be glad to hear your insights, thank you

kaltinel avatar Aug 18 '22 08:08 kaltinel

I can't really say without seeing the data and having an example to debug. Is it possible the .bam file was incomplete when nanopolish eventalign was started in the 13GB case?

jts avatar Aug 18 '22 19:08 jts

I don't think so because in my pipeline, nanopolish starts after minimap2 outputs its results. I ran the nanopolish on this raw data 3 separate times to ensure that 13GB is the correct size. Also, the number of reads going into the nanopolish eventalign is the same in between single run or multiple run of nanopolish on the same library (329833), therefore I believe incomplete .bam is not the reason.

Based on my understanding, you say nanopolish should give the same .tsv file on the raw data in every run? Doesn't matter the first run or the fifth? I can try to share the data. Would you need the piece of bam file and fast5 and fastq of the reads?

kaltinel avatar Aug 19 '22 08:08 kaltinel

Could you check whether the 41GB file contains any duplicated data? I suggest grepping for a single read name in each file to see whether the number of lines match in each case.

jts avatar Aug 19 '22 13:08 jts