[Feature suggestion] Downsample sequences to a certain number of total bases based on sequence length or sequence quality

Open thallinger opened this issue 3 years ago • 1 comments

Hi @shenwei356, thanks a lot for providing this awesome swiss-army-knife for sequence file manupulation.

I would like to suggest a feature (actually two) to downsample a sequence file to a certain number of total bases in the file based on either the sequence length or the average base quality of the read:

 seqkit seq --qual-bases 100000000
 seqkit seq --length-bases 100000000

Basically, the sequences should be sorted either by sequence length or average sequence quality in decreasing order and the top sequences, whose lengths add up to the given number of bases should be extracted. This would allow to retain the longest reads/the reads with the best quality yielding the given number of bases.

Jun 30 '22 15:06 thallinger

This would allow to retain the longest reads/the reads with the best quality yielding the given number of bases.

I'm not sure if this is reasonable.

downsample a sequence file to a certain number of total bases in the file based on either the sequence length

It could be added to seqkit sample.

or the average base quality of the read:

Let other QC tools do this.

Jul 11 '22 11:07 shenwei356