cdhit icon indicating copy to clipboard operation
cdhit copied to clipboard

Running cd-hit with 26Gb data set is too slow

Open biolittleboy opened this issue 4 years ago • 5 comments

My nucleic acid data set is about 26Gb with 29,938,643 sequences. The software has been running for a month with the parameter of '-T 85 -M 100000 -c 0.9', but it does not seem to start clustering. Could you give me some advice about how to run cd-hit with my big data set, or other software that can replace cd-hit?

biolittleboy avatar Jul 06 '20 02:07 biolittleboy

Hi, I'm having a similar issue. Did you find a way to speed-up the cd-hit or maybe found another software that does the same? Thanks!!

SaiReddy-A avatar Dec 17 '21 15:12 SaiReddy-A

Same issue here. Increasing the number of CPUs seems didn't increase the speed at all. It only processed ~10k sequences (0.1% of my data) with 30 CPUs running for a week. Please let me know how to work with the huge data. Thanks

Kennyluo4 avatar Feb 21 '22 02:02 Kennyluo4

I have the same question with you, I use cd-hit-est with fasta file input. maybe the fasta file is slower than encoded database file? have you solve this question? could you give me some advice, thanks

mintuos avatar Nov 22 '23 01:11 mintuos

Hi, I'm having a similar issue. Did you find a way to speed-up the cd-hit or maybe found another software that does the same? Thanks!!

WangLitt avatar Mar 22 '24 07:03 WangLitt

hello,to be honest, i chose to use mmseqs in the end. this may be faster in the big datasets. good luck.

mintuos avatar Mar 25 '24 10:03 mintuos