scalene
scalene copied to clipboard
Missing memory profiling in a function (with contrast to `memory_profiler`)
memory_profiler
and scalene
given different depths of coverage of memory profiling on a simple Pandas example, Scalene gives less information, I think this is a bug.
The sample code is listed in full below. I include full outputs from the code for both profilers. Specifically look at get_mean_for_indicator_poor
where memory_profiler
identifies line 19 as costing 850MB whilst Scalene identifies nothing. In the get_mean_for_indicator_better
function both profilers correctly identify 27 as being expensive.
@emeryberger you may recognise this code as being a variant of https://github.com/pandas-dev/pandas/issues/37139
Scalene output (using Scalene 1.2.4):
Memory usage: ▁▁▁▁▄▄▇▇▇████████ (max: 2.07GB, growth rate: 80%)
dataframe_example.py: % of time = 100.00% out of 3.70s.
╷ ╷ ╷ ╷ ╷ ╷ ╷ ╷
Line │Time % │Time % │Sys │Mem % │Net │Memory usage │Copy │
│Python │native │% │Python │(MB) │over time / % │(MB/s) │dataframe_example.py
╺━━━━━━━┿━━━━━━━━┿━━━━━━━┿━━━━┿━━━━━━━┿━━━━━━┿━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸
1 │ │ │ │ │ │ │ │import time
2 │ 3% │ 16% │ │ 3% │ 42 │▁▁▁▁▁▁ 3% │ │import numpy as np
3 │ 5% │ │ │ 3% │ 17 │▁▁▁▁▁ 4% │ │import pandas as pd
4 │ │ │ │ │ │ │ │
5 │ │ │ │ │ │ │ │# this assumes you have memory_profiler installed
6 │ │ │ │ │ │ │ │# if you want to use "@profile" on a function
7 │ │ │ │ │ │ │ │# if not, we can ignore it with a pass-through decorator
8 │ │ │ │ │ │ │ │if 'profile' not in dir():
9 │ │ │ │ │ │ │ │ def profile(fn):
10 │ │ │ │ │ │ │ │ return fn
11 │ │ │ │ │ │ │ │
12 │ │ │ │ │ │ │ │SIZE = 10_000_000
13 │ │ │ │ │ │ │ │
14 │ │ │ │ │ │ │ │
15 │ │ │ │ │ │ │ │@profile
16 │ │ │ │ │ │ │ │def get_mean_for_indicator_poor(df, indicator):
17 │ │ │ │ │ │ │ │ # poor way to use a groupby here, causes big allocation
18 │ │ │ │ │ │ │ │ gpby = df.groupby('indicator')
19 │ │ 13% │ │ │ │ │ │ means = gpby.mean() # means by column
20 │ │ │ │ │ │ │ │ means_for_ind = means.loc[indicator]
21 │ │ │ │ │ │ │ │ total = means_for_ind.sum()
22 │ │ │ │ │ │ │ │ return total
23 │ │ │ │ │ │ │ │
24 │ │ │ │ │ │ │ │@profile
25 │ │ │ │ │ │ │ │def get_mean_for_indicator_better(df, indicator, rnd_cols):
26 │ │ │ │ │ │ │ │ # more memory efficient and faster way to solve this challenge
27 │ 3% │ 12% │ │ │ 240 │▁▁▁▁▁▁▁▁▁ 11% │ │ df_sub = df.query('indicator==@indicator')[rnd_cols]
28 │ │ │ │ │ │ │ │ means_for_ind = df_sub.mean() # means by column
29 │ │ │ │ │ │ │ │ total = means_for_ind.sum() # sum of rows
30 │ │ │ │ │ │ │ │ return total
31 │ │ │ │ │ │ │ │
32 │ │ │ │ │ │ │ │
33 │ │ │ │ │ │ │ │@profile
34 │ │ │ │ │ │ │ │def run():
35 │ │ 21% │ │ │ 763 │▂ 35% │ │ arr = np.random.random((SIZE, 10))
36 │ │ │ │ │ │ │ │ print(f"{arr.shape} shape for our array")
37 │ │ │ │ │ │ │ │ df = pd.DataFrame(arr)
38 │ │ │ │ │ │ │ │ rnd_cols = [f"c_{n}" for n in df.columns]
39 │ │ │ │ │ │ │ │ df.columns = rnd_cols
40 │ │ │ │ │ │ │ │
41 │ │ │ │ │ │ │ │ # make a big dataframe with an indicator column and lots of random data
42 │ │ 5% │ │ │ 151 │▁▁▁▁ 7% │ │ df2 = pd.DataFrame({'indicator' : np.random.randint(0, 10, SIZE)})
43 │ │ │ │ │ │ │ │ # deliberately overwrite the first df
44 │ │ 18% │ │ │ 839 │▂▂▂▃ 39% │ │ df = pd.concat((df2, df), axis=1) # PART OF DEMO - unexpected copy=True forces an expensive copy
45 │ │ │ │ │ │ │ │ print("Head of our df:")
46 │ │ │ │ │ 0 │▁▁▁▁▁▁ │ │ print(df.head())
47 │ │ │ │ │ │ │ │
48 │ │ │ │ │ │ │ │ print("Print results to check that we get the result")
49 │ │ │ │ │ │ │ │ indicator = 2
50 │ │ │ │ │ │ │ │ print(f"Mean for indicator {indicator} on better implementation {get_mean_for_indicator_better(df, indicator, rnd_cols):0.5
51 │ │ │ │ │ │ │ │ print(f"Mean for indicator {indicator} on poor implementation: {get_mean_for_indicator_poor(df, indicator):0.5f}")
52 │ │ │ │ │ │ │ │
53 │ │ │ │ │ │ │ │
54 │ │ │ │ │ │ │ │if __name__ == "__main__":
55 │ │ │ │ │ │ │ │ run()
│ │ │ │ │ │ │ │
╶───────┼────────┼───────┼────┼───────┼──────┼──────────────┼───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╴
│ │ │ │ │ │ │ │function summary
1386 │ 8% │ 16% │ │ 6% │ 59 │▂▂▂▃▃▁ 7% │ │profile_code
33 │ 2% │ 44% │ │ │ 1754 │██████ 82% │ │run
24 │ 3% │ 13% │ │ │ 240 │█████████ 11% │ │get_mean_for_indicator_better
│ │ 13% │ │ │ │ │ │get_mean_for_indicator_poor
╵ ╵ ╵ ╵ ╵ ╵ ╵ ╵
Top net memory consumption, by line:
(1) 44: 839 MB
(2) 35: 763 MB
(3) 27: 240 MB
(4) 42: 151 MB
(5) 2: 42 MB
memory_profiler
output:
Line # Mem usage Increment Occurences Line Contents
============================================================
15 1834.512 MiB 1834.512 MiB 1 @profile
16 def get_mean_for_indicator_poor(df, indicator):
17 # poor way to use a groupby here, causes big allocation
18 1834.512 MiB 0.000 MiB 1 gpby = df.groupby('indicator')
19 2682.523 MiB 848.012 MiB 1 means = gpby.mean() # means by column
20 2682.523 MiB 0.000 MiB 1 means_for_ind = means.loc[indicator]
21 2682.523 MiB 0.000 MiB 1 total = means_for_ind.sum()
22 2682.523 MiB 0.000 MiB 1 return total
Filename: dataframe_example.py
Line # Mem usage Increment Occurences Line Contents
============================================================
24 1757.125 MiB 1757.125 MiB 1 @profile
25 def get_mean_for_indicator_better(df, indicator, rnd_cols):
26 # more memory efficient and faster way to solve this challenge
27 1918.480 MiB 161.355 MiB 1 df_sub = df.query('indicator==@indicator')[rnd_cols]
28 1918.480 MiB 0.000 MiB 1 means_for_ind = df_sub.mean() # means by column
29 1918.480 MiB 0.000 MiB 1 total = means_for_ind.sum() # sum of rows
30 1918.480 MiB 0.000 MiB 1 return total
Filename: dataframe_example.py
Line # Mem usage Increment Occurences Line Contents
============================================================
33 77.156 MiB 77.156 MiB 1 @profile
34 def run():
35 840.086 MiB 762.930 MiB 1 arr = np.random.random((SIZE, 10))
36 840.086 MiB 0.000 MiB 1 print(f"{arr.shape} shape for our array")
37 840.086 MiB 0.000 MiB 1 df = pd.DataFrame(arr)
38 840.086 MiB 0.000 MiB 13 rnd_cols = [f"c_{n}" for n in df.columns]
39 840.086 MiB 0.000 MiB 1 df.columns = rnd_cols
40
41 # make a big dataframe with an indicator column and lots of random data
42 916.801 MiB 76.715 MiB 1 df2 = pd.DataFrame({'indicator' : np.random.randint(0, 10, SIZE)})
43 # deliberately overwrite the first df
44 1756.141 MiB 839.340 MiB 1 df = pd.concat((df2, df), axis=1) # PART OF DEMO - unexpected copy=True forces an expensive copy
45 1756.141 MiB 0.000 MiB 1 print("Head of our df:")
46 1757.125 MiB 0.984 MiB 1 print(df.head())
47
48 1757.125 MiB 0.000 MiB 1 print("Print results to check that we get the result")
49 1757.125 MiB 0.000 MiB 1 indicator = 2
50 1834.512 MiB 1834.512 MiB 1 print(f"Mean for indicator {indicator} on better implementation {get_mean_for_indicator_better(df, indicator, rnd_cols):0.5f}")
51 1843.402 MiB 1843.402 MiB 1 print(f"Mean for indicator {indicator} on poor implementation: {get_mean_for_indicator_poor(df, indicator):0.5f}")
I'm using Linux Mint 20.1 (Cinnamon):
$ inxi
CPU: Quad Core Intel Core i7-6700HQ (-MT MCP-) speed/min/max: 1412/800/2600 MHz Kernel: 5.4.0-66-generic x86_64
Up: 19d 21h 34m Mem: 8970.6/31876.3 MiB (28.1%) Storage: 953.87 GiB (50.7% used) Procs: 320 Shell: bash 5.0.17 inxi: 3.0.38
Full source:
import time
import numpy as np
import pandas as pd
# this assumes you have memory_profiler installed
# if you want to use "@profile" on a function
# if not, we can ignore it with a pass-through decorator
if 'profile' not in dir():
def profile(fn):
return fn
SIZE = 10_000_000
@profile
def get_mean_for_indicator_poor(df, indicator):
# poor way to use a groupby here, causes big allocation
gpby = df.groupby('indicator')
means = gpby.mean() # means by column
means_for_ind = means.loc[indicator]
total = means_for_ind.sum()
return total
@profile
def get_mean_for_indicator_better(df, indicator, rnd_cols):
# more memory efficient and faster way to solve this challenge
df_sub = df.query('indicator==@indicator')[rnd_cols]
means_for_ind = df_sub.mean() # means by column
total = means_for_ind.sum() # sum of rows
return total
@profile
def run():
arr = np.random.random((SIZE, 10))
print(f"{arr.shape} shape for our array")
df = pd.DataFrame(arr)
rnd_cols = [f"c_{n}" for n in df.columns]
df.columns = rnd_cols
# make a big dataframe with an indicator column and lots of random data
df2 = pd.DataFrame({'indicator' : np.random.randint(0, 10, SIZE)})
# deliberately overwrite the first df
df = pd.concat((df2, df), axis=1) # PART OF DEMO - unexpected copy=True forces an expensive copy
print("Head of our df:")
print(df.head())
print("Print results to check that we get the result")
indicator = 2
print(f"Mean for indicator {indicator} on better implementation {get_mean_for_indicator_better(df, indicator, rnd_cols):0.5f}")
print(f"Mean for indicator {indicator} on poor implementation: {get_mean_for_indicator_poor(df, indicator):0.5f}")
if __name__ == "__main__":
run()
Can you test with the repo version? (Just verified that it works for me; we're planning on releasing this soon as 1.3.1.)
python3 -m pip install -U git+https://github.com/plasma-umass/scalene
should do the trick.
...and now in version 1.3.1, on pip.
Since I believe this is fixed, closing. @ianozsvald please re-open if it's not fixed on your end!
I've upgraded to 1.3.2. I note that on each run I get a different result now - perhaps this is a consequence of using a sampling profiler? Notably:
15 │ │ │ │ │ │ │ │ │@profile
16 │ │ │ │ │ │ │ │ │def get_mean_for_indicator_poor(df, indicator):
17 │ │ │ │ │ │ │ │ │ gpby = df.groupby('indicator')
18 │ 2% │ 10% │ │ │ │829.6M │▃▃▃▃▃▃▃ 29% │ │ means = gpby.mean() # means by column
19 │ │ │ │ │ │ │ │ │ means_for_ind = means.loc[indicator]
20 │ │ │ │ │ │ │ │ │ total = means_for_ind.sum()
21 │ │ │ │ │ │ │ │ │ return total
22 │ │ │ │ │ │ │ │ │
23 │ │ │ │ │ │ │ │ │@profile
24 │ │ │ │ │ │ │ │ │def get_mean_for_indicator_better(df, indicator, rnd_cols):
25 │ 2% │ 9% │ │ │ │416.7M │▁▁▁▁▁▁▁▁ 13% │ │ df_sub = df.query('indicator==@indicator')[rnd_cols]
26 │ │ │ │ │ │ │ │ │ means_for_ind = df_sub.mean() # means by column
27 │ │ │ │ │ │ │ │ │ total = means_for_ind.sum() # sum of rows
28 │ │ │ │ │ │ │ │ │ return total
and
15 │ │ │ │ │ │ │ │ │@profile
16 │ │ │ │ │ │ │ │ │def get_mean_for_indicator_poor(df, indicator):
17 │ │ │ │ │ │ │ │ │ gpby = df.groupby('indicator')
18 │ │ 11% │ │ │ │498.7M │▃▃▃▃▃▃▂▂ 29% │ │ means = gpby.mean() # means by column
19 │ │ │ │ │ │ │ │ │ means_for_ind = means.loc[indicator]
20 │ │ │ │ │ │ │ │ │ total = means_for_ind.sum()
21 │ │ │ │ │ │ │ │ │ return total
22 │ │ │ │ │ │ │ │ │
23 │ │ │ │ │ │ │ │ │@profile
24 │ │ │ │ │ │ │ │ │def get_mean_for_indicator_better(df, indicator, rnd_cols):
25 │ 2% │ 9% │ │ │ 1% │417.5M │▁▁▁▁ 13% │ │ df_sub = df.query('indicator==@indicator')[rnd_cols]
26 │ │ │ │ │ │ │ │ │ means_for_ind = df_sub.mean() # means by column
27 │ │ │ │ │ │ │ │ │ total = means_for_ind.sum() # sum of rows
28 │ │ │ │ │ │ │ │ │ return total
I'm using the same code as before. get_mean_for_indicator_better
seems stable, on 5 runs I get the same answer, get_mean_for_indicator_poor
is more variable (I recorded 498M, 829M, 617M, 617M, 948M on 5 consecutive runs).
The Memory usage: .... (max: XXX
output for the 5 runs was pretty similar at circa 2.95GB-3.02GB.
Could you comment on the reason for the variability? If it is due to sampling - is there a way to make the sampling occur more frequently? I'm asking partly for academic interest (when I'm teaching, as I have done using scalene recently) and partly because this sort of variability when diagnosing Pandas would hamper efforts of folk to try to figure out what the heck Pandas is doing :-)
@ianozsvald thanks for the report - we are looking into it!
Belatedly: I believe this has stabilized. @ianozsvald can you give it another try? Right now, there is no way to make sampling occur more frequently but this is something we can look into if the current status isn't quite there. Thanks!
Please install from the repo version for now: pip install -U git+https://github.com/plasma-umass/scalene
- thanks!
Sorry for the delay, I'm testing in prep for my next Higher Performance Python class (frustratingly on Windows for a pension fund, so I only get a slide demo of Scalene this time).
Using Scalene 1.3.6 the problematic inconsistency has gone away and this is a good example of 5 runs:
memory_profiler
still has different numbers:
18 2683.125 MiB 847.988 MiB 1 means = gpby.mean() # means by column
...
25 1918.949 MiB 162.168 MiB 1 df_sub = df.query('indicator==@indicator')[rnd_cols]
I've renamed the function names (variant 1 & 2 rather than better & poor) but otherwise that part hasn't changed.
What has changed is I now also delete a column, Scalene records nothing here (!) but memory_profiler
records a 680MB cost (it looks as though the block manager is duplicating columns internally, freeing them later, dependant on how you delete columns and on the state of the block manager...non trivial stuff):
# Scalene
51 │ │ │ │ │ │ │ │ │ # now start some clean-up pending some further (not yet written) work
52 │ │ │ │ │ │ │ │ │ print("Prior to cleaning-up we have:", df.info())
53 │ │ │ │ │ │ │ │ │ del df['c_0'] # we don't actually need this column so we'll save some RAM
54 │ │ 13% │ │ │ │ │ │ 168 │ print("After cleaning up we have:", df.info())
# memory_profiler
51 # now start some clean-up pending some further (not yet written) work
52 1844.055 MiB 0.000 MiB 1 print("Prior to cleaning-up we have:", df.info())
53 2530.629 MiB 686.574 MiB 1 del df['c_0'] # we don't actually need this column so we'll save some RAM
54 2530.629 MiB 0.000 MiB 1 print("After cleaning up we have:", df.info())
There are reports in the Pandas bug tracker using memory_profiler
to identify this sort of issue...I wonder if we can confirm it with Scalene or show that actually something else is happening? del ...
or df.drop(columns=['c_0'], inplace=True)
do the same job and have different memory costs.