[Feature suggestion] Downsample sequences to a certain number of total bases based on sequence length or sequence quality
Hi @shenwei356, thanks a lot for providing this awesome swiss-army-knife for sequence file manupulation.
I would like to suggest a feature (actually two) to downsample a sequence file to a certain number of total bases in the file based on either the sequence length or the average base quality of the read:
seqkit seq --qual-bases 100000000
seqkit seq --length-bases 100000000
Basically, the sequences should be sorted either by sequence length or average sequence quality in decreasing order and the top sequences, whose lengths add up to the given number of bases should be extracted. This would allow to retain the longest reads/the reads with the best quality yielding the given number of bases.
This would allow to retain the longest reads/the reads with the best quality yielding the given number of bases.
I'm not sure if this is reasonable.
downsample a sequence file to a certain number of total bases in the file based on either the sequence length
It could be added to seqkit sample.
or the average base quality of the read:
Let other QC tools do this.