sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

`sourmash compare` runs out of memory on large comparisons

Open yuzie0314 opened this issue 9 months ago • 15 comments

Hi @ctb the sourmash author,

Currently we are working on using your tool to find the representative MAGs within a customed data set assembled from several deeply sequenced stool shotgun samples. Those MAGs actually are classified as the same family level using gtdbtk reference genomes. We have up to 14,500 genomes in this data set, and we want to compute an ani pair-wise matrix using the following command.

sourmash compare -p 8 -k 31 --ani -o ani_matrix.numpy --csv ani_matrix.csv cluster_mash/*.sig

However, after around 1 hr processing, we got a weird error called BrokenPipeError, so we started to think if there is any limitation when using sourmash compare to generate an ani matrix. I think this kind of error is dereived from out off memory, correct me if I am wrong.

P.S. We are using 16 cores and 32 Gb ram, aws EC2 Linux. we also saw a message called Killedison for index 886 done in 9.36945 seconds, which might be another reason why this error happend. Current version is v4.8.2 sourmash.

yuzie0314 avatar Apr 30 '24 03:04 yuzie0314