cipher-workflow-platform
cipher-workflow-platform copied to clipboard
resample
Hi,
I found out that the subsample process is pretty slow. I have tried the reformat.sh function to subsample 155M PE reads in just 100 PE reads and it takes more than 15 minutes on my local machine. I guess the function is scanning the entire file to keep representative sequences. But is it really necessary? Couldn't we just take the first N reads of the file?
I have replaced this function by a very simple script below that takes seconds, that just takes the first N reads.
lineNb = Math.round(params.subsampled_reads * 4)
gunzip -c ${read1} | head -${lineNb} > ${id}_R1.subsampled.fq
gzip ${id}_R1.subsampled.fq
gunzip -c ${read2} | head -${lineNb} > ${id}_R2.subsampled.fq
gzip ${id}_R2.subsampled.fq
Best, Jerome