Linclust does not respect memory limit at the end of clustering.
I ran linclust on a machine with a relatively low memory limit compared to the memory limit on the size of the machine. While the linclust process is running, the memory seems to hold steady at the expected value however at the end of clustering, presumeably as the clusters are being written to memory, the machine OOMs.
The last few lines before the OOM are
Updating clustering... [87.458s]
Freeing memory... [1.516s]
Total clusters: <n>
Total time: <t>s
I was using diamond v2.1.8. Does diamond linclust need to collect all the results into memory before writing?
For writing the output a table containing all accessions of the input sequences needs to be loaded into memory. Looks like this is causing the OOM. I can provide a quick patch for this if need be.
A patch for this would be great since i don't think i have a machine at the moment that can manage all the accessions in memory at the same time.
https://github.com/bbuchfink/diamond/commit/8ec818ca160fbc26b262ae999c5c11d9c98a7e38
You can use --oid-output to write oids instead of accessions into the output file. These are the sequences linearly numbered in the input file starting from 0. You can map it back to accessions using standard shell commands. It should get the memory use down enough.
https://github.com/bbuchfink/diamond/wiki/How-to-cluster-huge-datasets
@beazerj The latest release has a new feature to run linclust in parallel on multiple nodes. May be interesting for you. Sensitivity should also have substantially improved, and you can probably do without the expensive all-vs-all steps. Still experimental, so handle with care.