seqtk icon indicating copy to clipboard operation
seqtk copied to clipboard

seqtk sample not working as expected

Open antoine4ucsd opened this issue 2 years ago • 0 comments

Hello I am trying to subsample fastq.gz file but not sure if it really works as expected above a given limit.

my source file contains 150k reads

awk '{s++}END{print s/4}' ./BA922J_barcode16_run5_merged.fastq.gz
150626

but when trying to subset:

seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 10000 > BA922J_10000.gz
seqtk sample -s100 BA922Jl_barcode16_run5_merged.fastq.gz 40000 > BA922J_40000.gz
seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 60000 > BA922J_60000.gz
seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 80000 > BA922J_80000.gz
seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 100000 > BA922J_100000.gz

then the file size is plateauing...

-rwxrwxrwx  1  staff    82M Jun  9 08:36 BA922J_10000.gz
-rwxrwxrwx  1  staff   314M Jun  9 08:36 BA922J_40000.gz
-rwxrwxrwx  1  staff   314M Jun  9 08:36 BA922J_60000.gz
-rwxrwxrwx  1  staff   314M Jun  9 08:36 BA922J_80000.gz
-rwxrwxrwx  1  staff   314M Jun  9 08:37 BA922J_100000.gz

also need to make sure this is not resampling the same reads. can you confirm (for example if I set the sample to 200k)

not sure what I am doing wrong... thank you!

antoine4ucsd avatar Jun 09 '22 16:06 antoine4ucsd