pandas icon indicating copy to clipboard operation
pandas copied to clipboard

Memory stays around after pickle cycle

Open mrocklin opened this issue 4 years ago • 7 comments

Hi Folks,

Related to #43155 I'm running into memory issues when pickling many small pandas dataframes. The following script creates a pandas dataframe, splits it up, and then pickles each little split and brings them back again. It then deletes all objects from memory but something is still sticking around. Here is the script, followed by the outputs of using memory_profiler

import numpy as np
import pandas as pd
import pickle


@profile
def test():
    df = pd.DataFrame(np.random.random((20000, 1000)))
    df["partitions"] = (df[0] * 10000).astype(int)
    _, groups = zip(*df.groupby("partitions"))
    del df

    groups = [pickle.dumps(group) for group in groups]
    groups = [pickle.loads(group) for group in groups]

    del groups


if __name__ == "__main__":
    test()
python -m memory_profiler memory_issue.py
Filename: memory_issue.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     7   76.574 MiB   76.574 MiB           1   @profile
     8                                         def test():
     9  229.445 MiB  152.871 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    10  230.738 MiB    1.293 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    11  398.453 MiB  167.715 MiB           1       _, groups = zip(*df.groupby("partitions"))
    12  245.633 MiB -152.820 MiB           1       del df
    13                                         
    14  445.688 MiB   47.273 MiB        8631       groups = [pickle.dumps(group) for group in groups]
    15  712.285 MiB  266.598 MiB        8631       groups = [pickle.loads(group) for group in groups]
    16                                         
    17  557.488 MiB -154.797 MiB           1       del groups

As you can see, we start with 70 MiB in memory and end with 550 MiB, despite all relevant objects being released. This leak increases with the number of groups (scale the 10000 number to move the leak up or down). Any help or pointers on how to track this down would be welcome.

mrocklin avatar Aug 21 '21 15:08 mrocklin

Another datapoint: running your script on OSX I'm seeing a lot more being released at the end

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     7   68.484 MiB   68.484 MiB           1   @profile
     8                                         def test():
     9  221.121 MiB  152.637 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    10  221.828 MiB    0.707 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    11  395.141 MiB  173.312 MiB           1       _, groups = zip(*df.groupby("partitions"))
    12  242.551 MiB -152.590 MiB           1       del df
    13                                         
    14  499.613 MiB  104.137 MiB        8684       groups = [pickle.dumps(group) for group in groups]
    15  915.664 MiB  284.641 MiB        8684       groups = [pickle.loads(group) for group in groups]
    16                                         
    17  286.395 MiB -629.270 MiB           1       del groups

Also if I add a gc.collect() after del groups i get another 40mb back.

jbrockmendel avatar Aug 22 '21 15:08 jbrockmendel

Same with gc.collect()

Filename: memory_issue.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     8   98.180 MiB   98.180 MiB           1   @profile
     9                                         def test():
    10  250.863 MiB  152.684 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    11  252.039 MiB    1.176 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    12  420.848 MiB  168.809 MiB           1       _, groups = zip(*df.groupby("partitions"))
    13  267.980 MiB -152.867 MiB           1       del df
    14                                         
    15  468.211 MiB   47.391 MiB        8643       groups = [pickle.dumps(group) for group in groups]
    16  738.316 MiB  270.105 MiB        8643       groups = [pickle.loads(group) for group in groups]
    17                                         
    18  579.688 MiB -158.629 MiB           1       del groups
    19  528.438 MiB  -51.250 MiB           1       gc.collect()

mrocklin avatar Aug 23 '21 14:08 mrocklin

Going though gc.get_objects() I don't see any big objects left behind

jbrockmendel avatar Aug 23 '21 21:08 jbrockmendel

FYI you should probably run this with MALLOC_TRIM_THRESHOLD_=0 python -m memory_profiler memory_issue.py on linux / DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python -m memory_profiler memory_issue.py on macOS to encourage the allocator to release pages back to the OS. memory_profiler is just tracking the RSS of the process, nothing fancier, so it's possible the memory has been free'd as far as pandas can get it, just not fully released.

gjoseph92 avatar Aug 24 '21 22:08 gjoseph92

I believe that I have run this with MALLOC_TRIM_THRESHOLD_=0 already and saw the same results, but I should verify

On Tue, Aug 24, 2021 at 5:57 PM Gabe Joseph @.***> wrote:

FYI you should probably run this with MALLOC_TRIM_THRESHOLD_=0 python -m memory_profiler memory_issue.py on linux / DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python -m memory_profiler memory_issue.py on macOS to encourage the allocator to release pages back to the OS. memory_profiler is just tracking the RSS of the process, nothing fancier, so it's possible the memory has been free'd as far as pandas can get it, just not fully released.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/43156#issuecomment-905029887, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTDPNMSWMLWF5M42T2TT6QPVDANCNFSM5CR6GRMA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

mrocklin avatar Aug 24 '21 23:08 mrocklin

Yes, same result on my linux/ubuntu machine running mambafoge.

mrocklin avatar Aug 24 '21 23:08 mrocklin

Is there a viable non-pickle alternative?

When I change pickle.dumps(group) to pickle.dumps(group.values) to pickle the underlying ndarrays I end up with 50-60 mb less (and the gc.collect no longer gets anything) than i do with pickling the DataFrames, but thats still 2-3 times the original footprint.

jbrockmendel avatar Aug 28 '21 19:08 jbrockmendel

checkout this response from stackoverflow.

ademhilmibozkurt avatar Dec 04 '24 14:12 ademhilmibozkurt

I tried different libraries from pickle. I used dill and joblib.

pickle library results

import pandas as pd
import numpy as np
import pickle
import dill
from io import BytesIO
import joblib
from memory_profiler import profile

@profile
def test1():
    df = pd.DataFrame(np.random.random((20000, 1000)))
    df["partitions"] = (df[0] * 10000).astype(int)
    _, groups = zip(*df.groupby("partitions"))
    del df
    
    groups = [pickle.dumps(group) for group in groups]
    groups = [pickle.loads(group) for group in groups]
    del groups
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     9    167.2 MiB    167.2 MiB           1   @profile
    10                                         def test1():
    11    319.8 MiB    152.6 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    12    320.3 MiB      0.4 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    13    633.2 MiB    313.0 MiB           1       _, groups = zip(*df.groupby("partitions"))
    14    480.7 MiB   -152.6 MiB           1       del df
    15                                             
    16    673.2 MiB   -103.4 MiB        8631       groups = [pickle.dumps(group) for group in groups]
    17    812.1 MiB    -42.9 MiB        8631       groups = [pickle.loads(group) for group in groups]
    18    248.5 MiB   -563.6 MiB           1       del groups

also tried different approach

groups = pickle.dumps(groups)
groups = pickle.loads(groups)
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     9    167.4 MiB    167.4 MiB           1   @profile
    10                                         def test1():
    11    320.1 MiB    152.7 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    12    320.6 MiB      0.4 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    13    634.2 MiB    313.6 MiB           1       _, groups = zip(*df.groupby("partitions"))
    14    481.6 MiB   -152.6 MiB           1       del df
    15                                             
    16    354.1 MiB   -127.4 MiB           1       groups = pickle.dumps(groups)
    17    363.4 MiB      9.2 MiB           1       groups = pickle.loads(groups)
    18    197.6 MiB   -165.7 MiB           1       del groups

joblib library results

@profile
def test3():
    df = pd.DataFrame(np.random.random((20000, 1000)))
    df["partitions"] = (df[0] * 10000).astype(int)
    _, groups = zip(*df.groupby("partitions"))
    del df
    
    bytes_container = BytesIO()
    groups = joblib.dump(groups, bytes_container)
    bytes_container.seek(0)
    groups = bytes_container.read()
    del groups
Line #   Mem usage    Increment  Occurrences   Line Contents
=============================================================
    31    167.7 MiB    167.7 MiB           1   @profile
    32                                         def test3():
    33    320.3 MiB    152.6 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    34    320.8 MiB      0.5 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    35    634.2 MiB    313.4 MiB           1       _, groups = zip(*df.groupby("partitions"))
    36    481.6 MiB   -152.6 MiB           1       del df
    37                                             
    38    481.6 MiB      0.0 MiB           1       bytes_container = BytesIO()
    39    352.5 MiB   -129.1 MiB           1       groups = joblib.dump(groups, bytes_container)
    40    352.5 MiB      0.0 MiB           1       bytes_container.seek(0)
    41    507.4 MiB    154.9 MiB           1       groups = bytes_container.read()
    42    352.5 MiB   -154.9 MiB           1       del groups

dill library results

@profile
def test2():
    df = pd.DataFrame(np.random.random((20000, 1000)))
    df["partitions"] = (df[0] * 10000).astype(int)
    _, groups = zip(*df.groupby("partitions"))
    del df

    groups = dill.dumps(groups)
    groups = dill.loads(groups)
    del groups
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    20    167.2 MiB    167.2 MiB           1   @profile
    21                                         def test2():
    22    319.8 MiB    152.6 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    23    320.2 MiB      0.4 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    24    632.8 MiB    312.6 MiB           1       _, groups = zip(*df.groupby("partitions"))
    25    480.3 MiB   -152.6 MiB           1       del df
    26                                         
    27    358.0 MiB   -122.2 MiB           1       groups = dill.dumps(groups)
    28    369.6 MiB     11.6 MiB           1       groups = dill.loads(groups)
    29    207.1 MiB   -162.6 MiB           1       del groups

I think this not about a library we are using. Python uses extra memory for processing. gc.collect() not a proper way to free memory. Please warn me if I took the wrong approach.

ademhilmibozkurt avatar Dec 04 '24 17:12 ademhilmibozkurt