vsearch icon indicating copy to clipboard operation
vsearch copied to clipboard

Option for minimum cluster size to output

Open A-N-Other opened this issue 6 years ago • 8 comments

I've been using vsearch a lot recently to group sequences with --cluster_fast. In my specific case, I've been ending up with many singletons (10,000s at a time) that get output in the same way as actual clusters. Apart from pretty drastically increasing downsteam processing time and effort (for example globbing regularly exceeds max line length) I've also exceeded various inode limits imposed on our cluster.

May I suggest that a useful additional feature could be a --min_cluster_size flag that prevents small clusters being output?

A-N-Other avatar Aug 09 '17 13:08 A-N-Other

Yes, the command --cluster_fast with the option --clusters can create a lot of files, and there is presently no way to filter out singletons. Adding a filtering option would make sense, indeed.

frederic-mahe avatar Aug 09 '17 14:08 frederic-mahe

I guess you can use dereplication with option --minuniquesize after your clustering, no?

fgvieira avatar Aug 21 '17 11:08 fgvieira

At the moment I just loop the FASTA files produced with --clusters, count the records with grep -c "^>" and discard those under a certain threshold.

Unless I'm missing something (quite possible!), I can't see how I'd integrate either of the --derep_ options into this workflow - the sequences I'm clustering aren't identical, so they'd all end up being discarded, no?

Assuming you'd be doing the --derep_ on the output of the initial run, then both this and my method still require writing ~30-60k files to disk unnecessarily. If I'm running multiple vsearch instances at a time with different options, then this really adds up!

A-N-Other avatar Aug 21 '17 12:08 A-N-Other

I was thinking after the initial run and piping between the two vsearch commands. something like:

vsearch --clust_fast input.fas --sizeout --consout - | vsearch --derep_fulllength - --minuniquesize 10 --output test.out

and after that you can split test.out into single fastas.

fgvieira avatar Aug 21 '17 13:08 fgvieira

Ah, I see where you're going. From my perspective, I'm actively avoiding the consensus sequences from the centre star method in favour of doing it afterwards with mafft, so I need to retain the original sequences. The inputs I'm working with require a reasonable amount of effort to make accurate consensus models - centre star would work fine for clustering and general alignment, but isn't good enough with my sequences to be able to resolve ORFs reliably.

A-N-Other avatar Aug 21 '17 14:08 A-N-Other

If we added an option to exclude the smallest clusters, which clustering output files should it apply to? These are the output file options: --biomout --centroids --clusters --consout --mothur_shared_out --msaout --otutabout --profile --uc.

I understand that you want it to apply to at least --clusters, but should it apply to some of the others too?

The dereplication commands have the --minuniquesize and --maxuniquesize option to limit the output from dereplication based on the abundances. In this case it influences only the FASTA output (--output option) from dereplication, while the UC output (--uc option) is unaffected. This is similar to USEARCH.

The sortbysize command also have analogous --minsize and --maxsize options to limit output.

torognes avatar Aug 22 '17 13:08 torognes

I'd have it apply it to all of those options, as it's equally relevant, I think, but that's my personal perspective. Equally, I don't know how simple this is to implement across the board.

I think there is a general case for it being more widely available, as several of these outputs become somewhat ambiguous with smaller clusters, and users will have different expectations and understandings of the output that vsearch returns - what's the consensus of 2 sequences, for example?

A-N-Other avatar Aug 22 '17 14:08 A-N-Other

I know it's not ideal, but here is a solution to delete singleton cluster files quickly:

# suppose all clusters start with clusters*
# you can change "${size} == 1" to another expression
ls clusters* | parallel 'size=$(grep -c "^>" {}); if [[ ${size} == 1 ]]; then rm {}; fi'

santiagosnchez avatar Jul 22 '21 05:07 santiagosnchez