seaborn icon indicating copy to clipboard operation
seaborn copied to clipboard

seaborn histplot is running out of memory

Open skoldobskiy opened this issue 3 years ago • 11 comments

Hello! It is my first Github bug report, so please let me know if some additional info is needed. I'm trying to use sns.pairplot for some pandas DataFrame and in result my computers are running out of memory and losing terminals with all the data. The problem set of digits (csv) can be downloaded here: https://www.dropbox.com/s/2dian5cipxx6dzu/temp.csv?dl=0 Interestingly, when I try to use sns.pairplot for another set of parameters with the same of an even greater number of lines, everything works nice.

Seaborn version: 0.11.0 This problem occurred on both my computers (OSx and Win10).

skoldobskiy avatar Oct 18 '20 17:10 skoldobskiy

The problem here is your R_2 variable, which has very many values with rather small variance and then a handful of extreme outliers. As a result, the default binwidth that numpy choses (with bins="auto") produces 74990593 bins, which hits resource limits when matplotlib tries to draw that many rectangles (I haven't gotten a memory error on my machine, but it does peg CPU at 100% for minutes+)

You could work around it by using a different approach for choosing bins, e.g.:

sns.pairplot(df, diag_kws={"bins": "sqrt"})

Using diag_kind="kde" also avoids the problem.

I'm not sure if it would be better to have some hard cap on the number of bins that histogram tries to draw (what would be the API?) or to issue a warning when numpy chooses a large (what threshold?) number of bins, so it's more obvious what's happening when something breaks.

mwaskom avatar Oct 19 '20 15:10 mwaskom

Another possible solution may be to not draw bars with 0 observations (the histogram binning itself doesn't take too much time, the bottleneck is all in matplotlib), although I am not sure if that would cause problems for existing downstream code (e.g. that adds count labels to a plot).

mwaskom avatar Oct 19 '20 15:10 mwaskom

Thanks!

Should I close this issue?

skoldobskiy avatar Oct 31 '20 15:10 skoldobskiy

Relevant upstream issue https://github.com/numpy/numpy/issues/11879

mwaskom avatar Jan 07 '21 18:01 mwaskom

Hello,

We just hit this memory issue and after reading about it, I think it might be useful to explain how we found it as it might help find a possible solution. Basically, we encountered when we moved from distplot to histplot, as distplot is now deprecated. In other words, using the same data and distplot, we don't hit this issue.

For example, using the report in https://github.com/mwaskom/seaborn/issues/2424, I had to ctrl-c after almost 12hrs:

$ /usr/bin/time -v python -c 'import seaborn as sns; sns.histplot([20.002347, 20.002347, 51.004152, 19.00218, 20.002346])'
^C
Command terminated by signal 2
	Command being timed: "python -c import seaborn as sns; sns.histplot([20.002347, 20.002347, 51.004152, 19.00218, 20.002346])"
	User time (seconds): 37711.48
	System time (seconds): 3495.40
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 11:26:54
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 306541212
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 2148925877
	Voluntary context switches: 5810
	Involuntary context switches: 158014
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

However, using the same data with distplot actually seem to work just fine (- the obvious warning):

$ /usr/bin/time -v python -c 'import seaborn as sns; sns.distplot([20.002347, 20.002347, 51.004152, 19.00218, 20.002346])'
/home/qiita_test/miniconda3/envs/qiime2-2021.4/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
	Command being timed: "python -c import seaborn as sns; sns.distplot([20.002347, 20.002347, 51.004152, 19.00218, 20.002346])"
	User time (seconds): 3.34
	System time (seconds): 6.07
	Percent of CPU this job got: 286%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.28
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 114328
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1
	Minor (reclaiming a frame) page faults: 45271
	Voluntary context switches: 5566
	Involuntary context switches: 174859
	Swaps: 0
	File system inputs: 17800
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Now, following the instruction here actually solves the issue in the test data, see below; however, sadly, in our data, it doesn't:

$ /usr/bin/time -v python -c 'import seaborn as sns; sns.histplot([20.002347, 20.002347, 51.004152, 19.00218, 20.002346], bins=2, kde=False)'
	Command being timed: "python -c import seaborn as sns; sns.histplot([20.002347, 20.002347, 51.004152, 19.00218, 20.002346], bins=2, kde=False)"
	User time (seconds): 3.29
	System time (seconds): 5.95
	Percent of CPU this job got: 272%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.39
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 115712
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 44365
	Voluntary context switches: 5375
	Involuntary context switches: 333760
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Anyway, any suggestions would be appreciated.

[edit] seaborn: 0.11.1 in Python 3.8.8

antgonza avatar May 21 '21 12:05 antgonza

distplot capped the number of automatically chosen bars at 50, so it never ran into this issue. But that felt like a hack without particular justification. There's some discussion of adopting a similar heuristic in numpy. I'd really prefer to have seaborn fully delegate the histogram computation and not have to explain that "auto" in seaborn means something different than "auto" in numpy. But that issue also seems stalled.

mwaskom avatar May 21 '21 13:05 mwaskom

@mwaskom, thank you for the prompt reply. Interestingly, we always set the number of bins in our code but we still see a huge difference between distplot and histplot - pointing to other differences. I'm happy to explore more to try to find them with our data or share it, if easier (any format preference?).

For example, changing that line takes it from Maximum resident set size (kbytes): 15965416 to Maximum resident set size (kbytes): 3607244 in our data.

antgonza avatar May 21 '21 13:05 antgonza

It looks like your code is using the same reference rule as numpy is, so you're running into the same problem:

x = [20.002347, 20.002347, 51.004152, 19.00218, 20.002346]
bins = np.histogram_bin_edges(x, "fd")
len(bins)
27361303

mwaskom avatar May 21 '21 13:05 mwaskom

My suggestion since you are wrapping histplot and defining the bins externally is to do bins = min(bins, MAX_BINS) where MAX_BINS is a reasonable integer that you know makes sense for your application.

mwaskom avatar May 21 '21 13:05 mwaskom

Workaround:

    import numpy
    try:
        sns.histplot(x=x, element='step')
    except np.core._exceptions._ArrayMemoryError:
        sns.histplot(
            x=x, element='step',
            bins='sturges',
        )

Demetrio92 avatar Apr 25 '23 15:04 Demetrio92