nanopolish icon indicating copy to clipboard operation
nanopolish copied to clipboard

Question about eventalign parallelization at file level

Open mmiladi opened this issue 4 years ago • 5 comments

Hi,

Is it possible to speedup eventalign computations by splitting the files and/or region windowing?

For example to speedup nanopolish eventalign --reads all.fastq --bam all.bam --genome genome.fa > all.tsv, split the fastq file and then run:

nanopolish eventalign --reads half1.fastq --bam all.bam --genome genome.fa > half1.tsv
nanopolish eventalign --reads half2.fastq --bam all.bam --genome genome.fa > half2.tsv
cat half1.tsv half2.tsv > all.tsv

Best,

mmiladi avatar Apr 25 '20 08:04 mmiladi

Yes, that is the recommended way to speed it up.

Jared

On Apr 25, 2020, at 4:51 AM, Milad Miladi [email protected] wrote:

 Hi,

Is it possible to speedup eventalign computations by splitting the files and/or region windowing?

For example to speedup nanopolish eventalign --reads all.fastq --bam all.bam --genome genome.fa > all.tsv, split the fastq file and the run:

nanopolish eventalign --reads half1.fastq --bam all.bam --genome genome.fa > half1.tsv nanopolish eventalign --reads half2.fastq --bam all.bam --genome genome.fa > half2.tsv cat half1.tsv half2.tsv > all.tsv Best,

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

jts avatar Apr 25 '20 11:04 jts

Great, Thanks. Would this also work with the window option '-w'? For the data I am using, the -w seems to be ineffective as I can see positions outside the requested range withing the .tsv table.

mmiladi avatar Apr 25 '20 12:04 mmiladi

Sorry, I misread your issue initially (I shouldn't try to answer emails first thing in the morning...).

Splitting the fastq would work, but isn't the recommended way since it will still iterate over every read in the bam, but ignore them because it won't find the signal data. You should provide a coordinate range as the last argument (without -w though):

nanopolish eventalign --reads all.fastq --bam all.bam --genome genome.fa chrA:0-1,000,000
nanopolish eventalign --reads all.fastq --bam all.bam --genome genome.fa chrA:1,000,000-2,000,000
[...]

jts avatar Apr 25 '20 13:04 jts

Thanks a lot for your prompt supports. The coordinate option hint would be very life (time) saving :-)

mmiladi avatar Apr 25 '20 19:04 mmiladi

Hi @jts ,

I have got stumbled on the expected input of the eventalign range option. There are cases where the output tsv is empty with no errors:

nanopolish eventalign --reads seq.fastq.gz --bam align.bam --genome ref.fa --samples --print-read-names --scale-events chr:21000-22000

[bam process] iterating over region:chr:21000-22000                                                                                                                

[post-run summary] total reads: 17556, unparseable: 0, qc fail: 2, could not calibrate: 0, no alignment: 1, bad fast5: 0

Here, I have spliced reads with 5'end at the upstream of position 21000, but all the reads fully cover the range 21000-22000. It seems, though not so sure, I only get the aligned events if I use a start range that covers the 5'end of the read. Is it the expected behavior? Is there a way to parallelize over a region for all the reads that have (partial or complete) bases aligned to the region? Best, -M

mmiladi avatar May 07 '20 11:05 mmiladi