sarek icon indicating copy to clipboard operation
sarek copied to clipboard

Support for Illumina ORA format

Open marchoeppner opened this issue 2 years ago • 3 comments

Description of feature

Hi,

Illumina has introduced a new read compression format, ORA: https://www.illumina.com/science/genomics-research/articles/design-ora-lossless-genomic-compression.html

ORA compresses human read data by 80% compared to traditional fastq.gz - I suspect it will become a commonly used option for data rolling off the upcoming NovaSeq X and NextSeq 1500 instruments (on-board support for ORA compression).

ORA is lossless and can be converted, or better yet streamed, into fastq.gz - which requires a reference and small command line utility - see: https://emea.support.illumina.com/sequencing/sequencing_software/DRAGENORA.html

For example, to stream ORA-compressed paired-end read data to bwa, you could do:

bwa mem humanref.fasta <(orad file.fastq.ora -c --raw --ora-reference /path/to/ora-reference ) > resu.sam

Would be nice to see support for this make it into Sarek.

marchoeppner avatar Jan 11 '23 12:01 marchoeppner

Looks interesting. Sounds easy enough to do. I'd like to see that in nf-core/modules first before adding to sarek. But we will definitively have a look

maxulysse avatar Jan 11 '23 12:01 maxulysse

I see the problem with the ORA reference genome -- you'll have to know exactly what was used and have access to it. (This is basically the same issue as with CRAMs, except that with those we supply the reference genome, so in the context of the pipeline, that's not an issue.) Presumably Illumina uses some versions for the common model organisms and provide a source from where to download it. Either we need to have code in the pipeline that handles the download, or use a parameter and make the user do it, or even add it to iGenomes.

tdanhorn avatar May 07 '24 21:05 tdanhorn