matchms icon indicating copy to clipboard operation
matchms copied to clipboard

numpy.core._exceptions.MemoryError

Open r00bit opened this issue 3 years ago • 4 comments

Hello, I have a large mgf file with the size of 25GB, and trying to calculate the similarity score between each pair of spectrum in the file. However, by calling CosineGreedy() I've got this error:

__self.scores = numpy.empty([self.n_rows, self.n_cols], dtype="object") numpy.core.exceptions.MemoryError: Unable to allocate 264. TiB for an array with shape (6023715, 6023715) and data type object I tried to fix it by setting overcommit_memory with this command: echo 1 > /proc/sys/vm/overcommit_memory, but it didn't work. Could you please tell me how I can fix it? Thank! OS: Ubuntu Server

r00bit avatar Jun 24 '22 16:06 r00bit

Aside from buying a load more RAM :-)

Given that the vast majority of scores will be zero, and the scores are symmetric, I’d be tempted to write my own loop over the pairs (itertools has efficient ways of doing this), computing the score, and then only storing the non-zero pairs (using a dict or something).

sdrogers avatar Jun 24 '22 18:06 sdrogers

Be interested to know why numpy needs to reserve that much memory to store references to 36 million objects. Seems an awful lot.

sdrogers avatar Jun 24 '22 18:06 sdrogers

We are working on a way to use sparse arrays instead of numpy arrays precisely for this reason. It is still not finished and got delayed a bit... but you could already give the existing development code a try --> https://github.com/matchms/matchms/pull/327

There are still some pieces missing, one (which is important here) is that the zeros from CosineGreddy() won't even be stored temporarily. I aim to add that over summer!

Right now I only see one workaround. I imagine you actually won't need all scores, so you could run a first step to look for spectral pairs within a certain mass range (that would also drastically speed up the calculation!). With this (and then followed by CosineGreddy) you could already run the sparse pipeline which is described in the link I added here.

florian-huber avatar Jun 24 '22 20:06 florian-huber

Thanks for your explanation, yeah maybe do some filtering to exclude some spectrum within a 0.1 Da then call CosineGreddy. Is there any configuration to set to find similarity between two mgf files? I mean only calculate the similarity score between spectrum of different files not the ones inside a file?

r00bit avatar Jun 24 '22 21:06 r00bit

This is meanwhile all possible with matchms >=0.18. using the sparse scores.

florian-huber avatar Apr 19 '23 19:04 florian-huber