numpy.core._exceptions.MemoryError
Hello, I have a large mgf file with the size of 25GB, and trying to calculate the similarity score between each pair of spectrum in the file. However, by calling CosineGreedy() I've got this error:
__self.scores = numpy.empty([self.n_rows, self.n_cols], dtype="object") numpy.core.exceptions.MemoryError: Unable to allocate 264. TiB for an array with shape (6023715, 6023715) and data type object I tried to fix it by setting overcommit_memory with this command: echo 1 > /proc/sys/vm/overcommit_memory, but it didn't work. Could you please tell me how I can fix it? Thank! OS: Ubuntu Server
Aside from buying a load more RAM :-)
Given that the vast majority of scores will be zero, and the scores are symmetric, I’d be tempted to write my own loop over the pairs (itertools has efficient ways of doing this), computing the score, and then only storing the non-zero pairs (using a dict or something).
Be interested to know why numpy needs to reserve that much memory to store references to 36 million objects. Seems an awful lot.
We are working on a way to use sparse arrays instead of numpy arrays precisely for this reason. It is still not finished and got delayed a bit... but you could already give the existing development code a try --> https://github.com/matchms/matchms/pull/327
There are still some pieces missing, one (which is important here) is that the zeros from CosineGreddy() won't even be stored temporarily. I aim to add that over summer!
Right now I only see one workaround. I imagine you actually won't need all scores, so you could run a first step to look for spectral pairs within a certain mass range (that would also drastically speed up the calculation!). With this (and then followed by CosineGreddy) you could already run the sparse pipeline which is described in the link I added here.
Thanks for your explanation, yeah maybe do some filtering to exclude some spectrum within a 0.1 Da then call CosineGreddy. Is there any configuration to set to find similarity between two mgf files? I mean only calculate the similarity score between spectrum of different files not the ones inside a file?
This is meanwhile all possible with matchms >=0.18. using the sparse scores.