seqtk icon indicating copy to clipboard operation
seqtk copied to clipboard

Request: Add seqtk shuffle command to randomise order of reads

Open peterjc opened this issue 5 years ago • 1 comments

I have been creating mock community samples using seqtk sample on some single isolate inputs, something like this:

rm -rf tempR1.fastq tempR2.fastq
for sample in A B C; do
    seqtk sample -s 123 input${sample}_R1.fastq.gz 10000 >> tempR1.fastq
    seqtk sample -s 123 input${sample}_R2.fastq.gz 10000 >> tempR2.fastq
done
gzip tempR1.fastq
gzip tempR2.fastq

In this example my combined FASTQ files will have the reads from sample A, then sample B, and finally sample C - and this ordering may introduce biases in the downstream analysis.

What I would like to do is finish with something like this:

seqtk shuffle -s 123 tempR1.fastq | gzip > mixed_R1.fastq.gz
seqtk shuffle -s 123 tempR2.fastq | gzip > mixed_R2.fastq.gz

Here I am assuming -s would set the random number seed as used in seqtk sample to ensure that both R1 and R2 are randomised in the same way, and the output remains nicely paired.

peterjc avatar Mar 22 '19 10:03 peterjc

@peterjc Until this is implemented, you can use seqkit shuffle

tseemann avatar Oct 18 '19 05:10 tseemann