bigbang icon indicating copy to clipboard operation
bigbang copied to clipboard

memory leaks/crashes with large, sparse matrices

Open npdoty opened this issue 7 years ago • 2 comments

In trying to create an activity matrix of the dnsext mailing list at IETF, Python takes more and more memory until, a little north of 40GB of memory usage, it's killed by my OS.

Could we use some different data structure from Pandas or numpy to handle very large matrices which are mostly 0s. It's taking significant disk space, and more importantly massive amounts of memory which sometimes makes operations fail to complete.

npdoty avatar Apr 04 '18 00:04 npdoty

Totally! What you'll want to look at are the Scipy sparse matrix formats.

https://docs.scipy.org/doc/scipy/reference/sparse.html

sbenthall avatar Apr 04 '18 00:04 sbenthall

pandas also has support for sparse data structures now, so this might be relatively straightforward. It requires specifying what the value is that you expect to be ubiquitous (like 0 or NaN) and it has some limitations on datatype.

I'm currently using the debugger to try to catch exactly where the memory is ballooning and Python is crashing.

npdoty avatar Jan 09 '19 00:01 npdoty