canu icon indicating copy to clipboard operation
canu copied to clipboard

extend splitHaplotype with fastq output option

Open ASLeonard opened this issue 1 year ago • 1 comments

Fairly straightforward changes to allow splitHaplotype to take a -fastq flag to print out triobinned fastq reads. Canu may not use the quality values in assembly, but as the primary triobinning program, this allows users to bin fastq reads directly instead of a slower "triobin fasta -> get read IDs -> extract fastq" process.

I tested this on my data for both binning fasta (normal) and fastq (with -fastq) and both appear to be working correctly.

Not sure on the memory implications of storing the quality values, could optionally uncomment this line

//if (g->_fastqOutput)
s->_quals[rr].set((const char*)seq.quals(), seq.length());

so only if the output is fastq do you load in the quals. But if the memory is initialised at _quals = new simpleString [_maxReads]; then this may not do much.

Also I reused the simpleString structure, which required casting to and from unsigned to signed char but this shouldn't be problematic.

ASLeonard avatar Jul 20 '22 11:07 ASLeonard

I also extended this to allow for seq.flags() (which is so beautifully accessible already), as this also nicely allows for extracting fastq from uBAMs with special sam tags, triobinning, and the re-aligning with the special sam tags carried over. However, this is a less common use-case, so I won't include that here without discussion.

ASLeonard avatar Jul 25 '22 08:07 ASLeonard