cipher-workflow-platform icon indicating copy to clipboard operation
cipher-workflow-platform copied to clipboard

resample

Open jsalignon opened this issue 7 years ago • 0 comments

Hi, I found out that the subsample process is pretty slow. I have tried the reformat.sh function to subsample 155M PE reads in just 100 PE reads and it takes more than 15 minutes on my local machine. I guess the function is scanning the entire file to keep representative sequences. But is it really necessary? Couldn't we just take the first N reads of the file? I have replaced this function by a very simple script below that takes seconds, that just takes the first N reads. lineNb = Math.round(params.subsampled_reads * 4)

gunzip -c ${read1} | head -${lineNb} > ${id}_R1.subsampled.fq
gzip ${id}_R1.subsampled.fq
gunzip -c ${read2} | head -${lineNb} > ${id}_R2.subsampled.fq
gzip ${id}_R2.subsampled.fq

Best, Jerome

jsalignon avatar Jan 22 '18 21:01 jsalignon