pandas
pandas copied to clipboard
Memory stays around after pickle cycle
Hi Folks,
Related to #43155 I'm running into memory issues when pickling many small pandas dataframes. The following script creates a pandas dataframe, splits it up, and then pickles each little split and brings them back again. It then deletes all objects from memory but something is still sticking around. Here is the script, followed by the outputs of using memory_profiler
import numpy as np
import pandas as pd
import pickle
@profile
def test():
df = pd.DataFrame(np.random.random((20000, 1000)))
df["partitions"] = (df[0] * 10000).astype(int)
_, groups = zip(*df.groupby("partitions"))
del df
groups = [pickle.dumps(group) for group in groups]
groups = [pickle.loads(group) for group in groups]
del groups
if __name__ == "__main__":
test()
python -m memory_profiler memory_issue.py
Filename: memory_issue.py
Line # Mem usage Increment Occurences Line Contents
============================================================
7 76.574 MiB 76.574 MiB 1 @profile
8 def test():
9 229.445 MiB 152.871 MiB 1 df = pd.DataFrame(np.random.random((20000, 1000)))
10 230.738 MiB 1.293 MiB 1 df["partitions"] = (df[0] * 10000).astype(int)
11 398.453 MiB 167.715 MiB 1 _, groups = zip(*df.groupby("partitions"))
12 245.633 MiB -152.820 MiB 1 del df
13
14 445.688 MiB 47.273 MiB 8631 groups = [pickle.dumps(group) for group in groups]
15 712.285 MiB 266.598 MiB 8631 groups = [pickle.loads(group) for group in groups]
16
17 557.488 MiB -154.797 MiB 1 del groups
As you can see, we start with 70 MiB in memory and end with 550 MiB, despite all relevant objects being released. This leak increases with the number of groups (scale the 10000 number to move the leak up or down). Any help or pointers on how to track this down would be welcome.
Another datapoint: running your script on OSX I'm seeing a lot more being released at the end
Line # Mem usage Increment Occurences Line Contents
============================================================
7 68.484 MiB 68.484 MiB 1 @profile
8 def test():
9 221.121 MiB 152.637 MiB 1 df = pd.DataFrame(np.random.random((20000, 1000)))
10 221.828 MiB 0.707 MiB 1 df["partitions"] = (df[0] * 10000).astype(int)
11 395.141 MiB 173.312 MiB 1 _, groups = zip(*df.groupby("partitions"))
12 242.551 MiB -152.590 MiB 1 del df
13
14 499.613 MiB 104.137 MiB 8684 groups = [pickle.dumps(group) for group in groups]
15 915.664 MiB 284.641 MiB 8684 groups = [pickle.loads(group) for group in groups]
16
17 286.395 MiB -629.270 MiB 1 del groups
Also if I add a gc.collect() after del groups i get another 40mb back.
Same with gc.collect()
Filename: memory_issue.py
Line # Mem usage Increment Occurences Line Contents
============================================================
8 98.180 MiB 98.180 MiB 1 @profile
9 def test():
10 250.863 MiB 152.684 MiB 1 df = pd.DataFrame(np.random.random((20000, 1000)))
11 252.039 MiB 1.176 MiB 1 df["partitions"] = (df[0] * 10000).astype(int)
12 420.848 MiB 168.809 MiB 1 _, groups = zip(*df.groupby("partitions"))
13 267.980 MiB -152.867 MiB 1 del df
14
15 468.211 MiB 47.391 MiB 8643 groups = [pickle.dumps(group) for group in groups]
16 738.316 MiB 270.105 MiB 8643 groups = [pickle.loads(group) for group in groups]
17
18 579.688 MiB -158.629 MiB 1 del groups
19 528.438 MiB -51.250 MiB 1 gc.collect()
Going though gc.get_objects() I don't see any big objects left behind
FYI you should probably run this with MALLOC_TRIM_THRESHOLD_=0 python -m memory_profiler memory_issue.py on linux / DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python -m memory_profiler memory_issue.py on macOS to encourage the allocator to release pages back to the OS. memory_profiler is just tracking the RSS of the process, nothing fancier, so it's possible the memory has been free'd as far as pandas can get it, just not fully released.
I believe that I have run this with MALLOC_TRIM_THRESHOLD_=0 already and saw the same results, but I should verify
On Tue, Aug 24, 2021 at 5:57 PM Gabe Joseph @.***> wrote:
FYI you should probably run this with MALLOC_TRIM_THRESHOLD_=0 python -m memory_profiler memory_issue.py on linux / DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python -m memory_profiler memory_issue.py on macOS to encourage the allocator to release pages back to the OS. memory_profiler is just tracking the RSS of the process, nothing fancier, so it's possible the memory has been free'd as far as pandas can get it, just not fully released.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/43156#issuecomment-905029887, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTDPNMSWMLWF5M42T2TT6QPVDANCNFSM5CR6GRMA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Yes, same result on my linux/ubuntu machine running mambafoge.
Is there a viable non-pickle alternative?
When I change pickle.dumps(group) to pickle.dumps(group.values) to pickle the underlying ndarrays I end up with 50-60 mb less (and the gc.collect no longer gets anything) than i do with pickling the DataFrames, but thats still 2-3 times the original footprint.
checkout this response from stackoverflow.
I tried different libraries from pickle. I used dill and joblib.
pickle library results
import pandas as pd
import numpy as np
import pickle
import dill
from io import BytesIO
import joblib
from memory_profiler import profile
@profile
def test1():
df = pd.DataFrame(np.random.random((20000, 1000)))
df["partitions"] = (df[0] * 10000).astype(int)
_, groups = zip(*df.groupby("partitions"))
del df
groups = [pickle.dumps(group) for group in groups]
groups = [pickle.loads(group) for group in groups]
del groups
Line # Mem usage Increment Occurrences Line Contents
=============================================================
9 167.2 MiB 167.2 MiB 1 @profile
10 def test1():
11 319.8 MiB 152.6 MiB 1 df = pd.DataFrame(np.random.random((20000, 1000)))
12 320.3 MiB 0.4 MiB 1 df["partitions"] = (df[0] * 10000).astype(int)
13 633.2 MiB 313.0 MiB 1 _, groups = zip(*df.groupby("partitions"))
14 480.7 MiB -152.6 MiB 1 del df
15
16 673.2 MiB -103.4 MiB 8631 groups = [pickle.dumps(group) for group in groups]
17 812.1 MiB -42.9 MiB 8631 groups = [pickle.loads(group) for group in groups]
18 248.5 MiB -563.6 MiB 1 del groups
also tried different approach
groups = pickle.dumps(groups)
groups = pickle.loads(groups)
Line # Mem usage Increment Occurrences Line Contents
=============================================================
9 167.4 MiB 167.4 MiB 1 @profile
10 def test1():
11 320.1 MiB 152.7 MiB 1 df = pd.DataFrame(np.random.random((20000, 1000)))
12 320.6 MiB 0.4 MiB 1 df["partitions"] = (df[0] * 10000).astype(int)
13 634.2 MiB 313.6 MiB 1 _, groups = zip(*df.groupby("partitions"))
14 481.6 MiB -152.6 MiB 1 del df
15
16 354.1 MiB -127.4 MiB 1 groups = pickle.dumps(groups)
17 363.4 MiB 9.2 MiB 1 groups = pickle.loads(groups)
18 197.6 MiB -165.7 MiB 1 del groups
joblib library results
@profile
def test3():
df = pd.DataFrame(np.random.random((20000, 1000)))
df["partitions"] = (df[0] * 10000).astype(int)
_, groups = zip(*df.groupby("partitions"))
del df
bytes_container = BytesIO()
groups = joblib.dump(groups, bytes_container)
bytes_container.seek(0)
groups = bytes_container.read()
del groups
Line # Mem usage Increment Occurrences Line Contents
=============================================================
31 167.7 MiB 167.7 MiB 1 @profile
32 def test3():
33 320.3 MiB 152.6 MiB 1 df = pd.DataFrame(np.random.random((20000, 1000)))
34 320.8 MiB 0.5 MiB 1 df["partitions"] = (df[0] * 10000).astype(int)
35 634.2 MiB 313.4 MiB 1 _, groups = zip(*df.groupby("partitions"))
36 481.6 MiB -152.6 MiB 1 del df
37
38 481.6 MiB 0.0 MiB 1 bytes_container = BytesIO()
39 352.5 MiB -129.1 MiB 1 groups = joblib.dump(groups, bytes_container)
40 352.5 MiB 0.0 MiB 1 bytes_container.seek(0)
41 507.4 MiB 154.9 MiB 1 groups = bytes_container.read()
42 352.5 MiB -154.9 MiB 1 del groups
dill library results
@profile
def test2():
df = pd.DataFrame(np.random.random((20000, 1000)))
df["partitions"] = (df[0] * 10000).astype(int)
_, groups = zip(*df.groupby("partitions"))
del df
groups = dill.dumps(groups)
groups = dill.loads(groups)
del groups
Line # Mem usage Increment Occurrences Line Contents
=============================================================
20 167.2 MiB 167.2 MiB 1 @profile
21 def test2():
22 319.8 MiB 152.6 MiB 1 df = pd.DataFrame(np.random.random((20000, 1000)))
23 320.2 MiB 0.4 MiB 1 df["partitions"] = (df[0] * 10000).astype(int)
24 632.8 MiB 312.6 MiB 1 _, groups = zip(*df.groupby("partitions"))
25 480.3 MiB -152.6 MiB 1 del df
26
27 358.0 MiB -122.2 MiB 1 groups = dill.dumps(groups)
28 369.6 MiB 11.6 MiB 1 groups = dill.loads(groups)
29 207.1 MiB -162.6 MiB 1 del groups
I think this not about a library we are using. Python uses extra memory for processing. gc.collect() not a proper way to free memory. Please warn me if I took the wrong approach.