bigbang
bigbang copied to clipboard
memory leaks/crashes with large, sparse matrices
In trying to create an activity matrix of the dnsext mailing list at IETF, Python takes more and more memory until, a little north of 40GB of memory usage, it's killed by my OS.
Could we use some different data structure from Pandas or numpy to handle very large matrices which are mostly 0s. It's taking significant disk space, and more importantly massive amounts of memory which sometimes makes operations fail to complete.
Totally! What you'll want to look at are the Scipy sparse matrix formats.
https://docs.scipy.org/doc/scipy/reference/sparse.html
pandas also has support for sparse data structures now, so this might be relatively straightforward. It requires specifying what the value is that you expect to be ubiquitous (like 0 or NaN) and it has some limitations on datatype.
I'm currently using the debugger to try to catch exactly where the memory is ballooning and Python is crashing.