cdhit
cdhit copied to clipboard
speeding up small word size
I'm trying to cluster ~1 million protein sequences with identity of 50%. When I've clustered by 60% identity I've used n=4 and it took few hours with 20 threads. But when reducing word size to 3 it takes very long time (something like 20k per day). I wanted to use n=4 also for 50% percent but it is impossible. Any suggestions how to speed it up? Thanks!