physt icon indicating copy to clipboard operation
physt copied to clipboard

Memory efficiency problem

Open janpipek opened this issue 8 years ago • 7 comments

When creating histogram from huge data, temporarily huge amount of memory is allocated, though no copy should be created.

Suspects:

  • dropna ???
  • weights

janpipek avatar Mar 26 '17 22:03 janpipek

Better in 0.3.26 where the overhead is lowered by 66 % (unnecessary flattens and weights)

janpipek avatar Mar 27 '17 09:03 janpipek

What would you class as huge data? I am interested in using the package, looks like an interesting option. I have a 300GB ND array (3dim), that I have as a chunked dask array. Everytime I go to run a histogram, it falls over at the last hurdle. The only way I could see this working was dynamically updating a file. Which I started to write but for me it wasnt trivial! Until I noticed this package and others.

So I just wondered what (and in the notes) would class as big not large data?

Sh4zKh4n avatar Feb 16 '20 13:02 Sh4zKh4n

Hi @Sh4zKh4n , if you sequentially fill the histogram, you should not have a problem with file of any size, the memory problem was more related to processing one big chunk at a time (which is impossible in your case anyway ;-)). fill_n is your friend. You don't even need to know the number of bins in advance, as I document in https://github.com/janpipek/physt/blob/master/doc/adaptive_histogram.ipynb .

Let me know if you spot any problem or any ideas for improvement.

janpipek avatar Feb 17 '20 07:02 janpipek

@janpipek So that's exactly the type of thing I was looking for. Is there a way to save to a file instead of holding in memory? Thanks I'll have a go at it later on my data set (once I take a break from daddy day care duties, trying to do ant coding with a 3yr old and 6 month old is a nigytmare.)

Sh4zKh4n avatar Feb 17 '20 08:02 Sh4zKh4n

I should be clearer about that, what mean is to dynamically update a table like a pandas file with values? So you can come back to the analysis later and also keep the memory footprint down? Cheers I do appreciate the quick response

Sh4zKh4n avatar Feb 17 '20 08:02 Sh4zKh4n

If I understand correctly, you want to be able to calculate the histogram once (or in multiple steps) and then re-use it a few times. Sure, histograms can be stored and loaded to/from JSON format. Example how I would go with huge data is here: https://github.com/janpipek/physt/blob/master/doc/interrupted-workflow.ipynb Hope that answers your question :-)

janpipek avatar Feb 17 '20 08:02 janpipek

so @janpipek , oh so that's nearly there, I kind of want to combine the two solutions you have , dynamic updating and saving to file!

Sh4zKh4n avatar Feb 17 '20 09:02 Sh4zKh4n