seqtk
seqtk copied to clipboard
seqtk sample not working as expected
Hello I am trying to subsample fastq.gz file but not sure if it really works as expected above a given limit.
my source file contains 150k reads
awk '{s++}END{print s/4}' ./BA922J_barcode16_run5_merged.fastq.gz
150626
but when trying to subset:
seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 10000 > BA922J_10000.gz
seqtk sample -s100 BA922Jl_barcode16_run5_merged.fastq.gz 40000 > BA922J_40000.gz
seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 60000 > BA922J_60000.gz
seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 80000 > BA922J_80000.gz
seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 100000 > BA922J_100000.gz
then the file size is plateauing...
-rwxrwxrwx 1 staff 82M Jun 9 08:36 BA922J_10000.gz
-rwxrwxrwx 1 staff 314M Jun 9 08:36 BA922J_40000.gz
-rwxrwxrwx 1 staff 314M Jun 9 08:36 BA922J_60000.gz
-rwxrwxrwx 1 staff 314M Jun 9 08:36 BA922J_80000.gz
-rwxrwxrwx 1 staff 314M Jun 9 08:37 BA922J_100000.gz
also need to make sure this is not resampling the same reads. can you confirm (for example if I set the sample to 200k)
not sure what I am doing wrong... thank you!