racon-gpu icon indicating copy to clipboard operation
racon-gpu copied to clipboard

Hard-coded length limits on `createCUDABatchAligner` causing poor performance

Open SamStudio8 opened this issue 4 years ago • 1 comments

I've been trying to polish one of our mock community datasets with racon-gpu, but am seeing slow performance during the overlap alignment phase.

Screenshot from 2019-10-23 09-06-45

I can see many alignments are not being run on the GPU, but the CPU instead. Admittedly, slow performance was exacerbated by the use of only four CPU cores. I've had a little look around the code and as I understand it, an alignment can be prevented from running on the GPU under two conditions:

I see there is also an error mode for exceeded_max_alignment_difference but I can't seem to find a case where that is actually raised by CUDAAligner.

I've checked the stats on the reads I am assembling and polishing with and the N50 is 28.3 Kbp (nice one @joshquick), so I'm thinking perhaps our longest reads are getting thrown off the GPU and are left to run on the CPU afterwards.

I've found where the CUDABatchAligner is initialised and see it has hard-coded limits of 15000 for both the max query and max target. Is this a specific limit for performance reasons, or would it be possible to perhaps allow users to set these limits themselves? Does the choice here affect the memory allocation on the GPU later? Ideally we'd want to raise it to at least 25Kbp, if not 50Kbp.

Just to check I was on the right track, I've filtered this data set of reads longer than 15Kbp and run the polishing again; and see there's now very little time spent aligning overlaps on the CPU. Though, I'm not entirely sure if this is just because the reads are <= 15 Kbp, or if there are fewer reads.

Screenshot from 2019-10-23 10-27-32

SamStudio8 avatar Oct 23 '19 09:10 SamStudio8

I thought I would try raising this myself, but it seems to linearly require more memory, meaning you must run fewer batches. This ends up taking much more GPU time overall, and presumably wastes a lot of memory in cases where the read overlaps are assigned to a batch are much shorter than the maximum allowed length. I wonder if there would be any point in having batches of different sizes and binning the overlaps; or ordering the overlaps by size and creating/destroying increasingly larger batches.

SamStudio8 avatar Oct 23 '19 11:10 SamStudio8