sarek
sarek copied to clipboard
Support for Illumina ORA format
Description of feature
Hi,
Illumina has introduced a new read compression format, ORA: https://www.illumina.com/science/genomics-research/articles/design-ora-lossless-genomic-compression.html
ORA compresses human read data by 80% compared to traditional fastq.gz - I suspect it will become a commonly used option for data rolling off the upcoming NovaSeq X and NextSeq 1500 instruments (on-board support for ORA compression).
ORA is lossless and can be converted, or better yet streamed, into fastq.gz - which requires a reference and small command line utility - see: https://emea.support.illumina.com/sequencing/sequencing_software/DRAGENORA.html
For example, to stream ORA-compressed paired-end read data to bwa, you could do:
bwa mem humanref.fasta <(orad file.fastq.ora -c --raw --ora-reference /path/to/ora-reference ) > resu.sam
Would be nice to see support for this make it into Sarek.
Looks interesting. Sounds easy enough to do. I'd like to see that in nf-core/modules first before adding to sarek. But we will definitively have a look
I see the problem with the ORA reference genome -- you'll have to know exactly what was used and have access to it. (This is basically the same issue as with CRAMs, except that with those we supply the reference genome, so in the context of the pipeline, that's not an issue.) Presumably Illumina uses some versions for the common model organisms and provide a source from where to download it. Either we need to have code in the pipeline that handles the download, or use a parameter and make the user do it, or even add it to iGenomes.