bonito copied to clipboard
Preprocessing reads is painful.
The preprocessing reads step is taking a very long time for large runs.
preprocessing reads: 1%|1 | 96/14384 [04:35<13:09:22, 3.31s/ fast5s]
In part this might be our hpc file system but is there anything that can be accelerated here?
Just bumping this again - after another crash where it's taken bonito nearly a day to just preprocess the reads...
The preprocessing step is required to construct the read group header in the SAM/BAM (making this always true would be a quick hack to skip it). The work is already parallelized across multiple fast5 with multiprocessing, you could try increasing the number of processes but I suspect you might be limited by your file system. Can you benchmark a local disk vs your NFS? For reference preprocessing ~10,000 fast5s on our NFS takes a couple of hours.
Do you have a traceback for the crash?
So its the same error as before:
Traceback (most recent call last):
File "/usr/lib/python3.8/", line 932, in _bootstrap_inner
File "/usr/local/lib/python3.8/dist-packages/ont_bonito_cuda11.3.0-0.5.1-py3.8.egg/bonito/", line 110, in run
for item in self.iterator:
File "/usr/local/lib/python3.8/dist-packages/ont_bonito_cuda11.3.0-0.5.1-py3.8.egg/bonito/crf/", line 67, in
Unfortunately i can't compare the speed on a different system as this is the only one available to us on which we can run bonito at present.
Just to update - I tried calling the same read set without using remora models and it crashes in the same way.
Preprocessing/Read group construction speed solved with Pod5 design.