modin
modin copied to clipboard
PERF-#6609: HDK: to_pandas(): Cache pandas DataFrame
What do these changes do?
- [x] first commit message and PR title follow format outlined here
NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
- [x] passes
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py - [x] passes
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py - [x] signed commit with
git commit -s - [x] Resolves #6609
- [ ] tests added and passing
- [ ] module layout described at
docs/development/architecture.rstis up-to-date
Adding a cache to this method could significantly improve performance of the methods, that are defaulting to pandas.
but it also doubles the memory consumption, doesn't it?
but it also doubles the memory consumption, doesn't it?
I think, it depends on the dataset. Some data could be shared between HDK, Arrow and Pandas. Form the other side, if we do no cache the pandas frame, we may have multiple copies of identical frames in the memory.
Here is a simple test, demonstrating the memory usage:
import psutil
import modin.pandas as pd
df = pd.DataFrame({"a": range(100000000)})
df = df.dropna() # Ensure the table is imported to HDK
mem0 = psutil.virtual_memory().used
print(f"{mem0}")
pdf1 = df._to_pandas()
mem1 = psutil.virtual_memory().used
print(f"{mem1}: + {mem1 - mem0}")
pdf2 = df._to_pandas()
mem2 = psutil.virtual_memory().used
print(f"{mem2}: + {mem2 - mem1}")
pdf3 = df._to_pandas()
mem3 = psutil.virtual_memory().used
print(f"{mem3}: + {mem3 - mem2}")
Output on the master branch:
10649378816
14785376256: + 4135997440
16392810496: + 1607434240
17996320768: + 1603510272
Output on this branch
10598752256
14746906624: + 4148154368
14746906624: + 0
14746906624: + 0
but it also doubles the memory consumption, doesn't it?
I think, it depends on the dataset. Some data could be shared between HDK, Arrow and Pandas. Form the other side, if we do no cache the pandas frame, we may have multiple copies of identical frames in the memory.
Here is a simple test, demonstrating the memory usage:
But note, that in real life we don't usually keep references on pandas dfs once the default-to-pandas operation is done, so to make this scenario more realistic we should delete pdf after each measurement:
import psutil
import modin.pandas as pd
df = pd.DataFrame({"a": range(100000000)})
df = df.dropna() # Ensure the table is imported to HDK
mem0 = psutil.virtual_memory().used
print(f"{mem0}")
pdf1 = df._to_pandas()
mem1 = psutil.virtual_memory().used
print(f"{mem1}: + {mem1 - mem0}")
del pdf1
pdf2 = df._to_pandas()
mem2 = psutil.virtual_memory().used
print(f"{mem2}: + {mem2 - mem1}")
del pdf2
pdf3 = df._to_pandas()
mem3 = psutil.virtual_memory().used
print(f"{mem3}: + {mem3 - mem2}")
del pdf3
Then on master I get:
8689057792
12090052608: + 3400994816
12083097600: + -6955008
11347464192: + -735633408
(the memory consumption decreases over the calls?)
And for your branch it's:
8684437504
12864876544: + 4180439040
12864876544: + 0
12864876544: + 0
but it also doubles the memory consumption, doesn't it?
I think, it depends on the dataset. Some data could be shared between HDK, Arrow and Pandas.
Well, if an arrow table with certain data can be converted to pandas by simply sharing its buffer, then shouldn't such conversion be almost free? Do you know columns with what data types can be converted that easy way?
But note, that in real life we don't usually keep references on pandas dfs once the default-to-pandas operation is done
It depends ... For example, in case of an unsupported data, the pandas df will be saved in partitions of the new HDK frame. Also, in this implementation, if an HDK frame is created from a Pandas frame, the Pandas frame is always saved in partitions. It's converted to arrow lazy, only when exporting to HDK.
For example, in case of an unsupported data, the pandas df will be saved in partitions of the new HDK frame.
Right, this is done so we wouldn't do unnecessary .to_pandas() conversions since we know in advance that all operations will default to pandas. However, how is this related to the changes in this PR? How caching pandas df in normal HDK frames could help in this case?
Also, in https://github.com/modin-project/modin/pull/6412 implementation, if an HDK frame is created from a Pandas frame, the Pandas frame is always saved in partitions. It's converted to arrow lazy, only when exporting to HDK.
This optimization is quite good, but again, how is this related to this PR?
how is this related to this PR?
Not related. These are just a few examples of when we do keep references on pandas dfs.
how is this related to this PR?
Not related. These are just a few examples of when we do
keep references on pandas dfs.
I understand that, but in those examples pandas dfs origin not from the .to_pandas() call but from the user, right? My original question was regarding keeping in memory the results of .to_pandas() after a default-to-pandas function is done.
but in those examples pandas dfs origin not from the
.to_pandas()call but from the user, right?
Not necessary. The frame, returned by to_pandas(), is used to build a new modin frame. Here is an example:
import psutil
import pandas as pd
# import modin.pandas as pd
df = pd.DataFrame(range(1000000), columns=pd.MultiIndex.from_tuples([(1,2,3)]))
mem0 = psutil.virtual_memory().used
df2 = df.iloc[:-1]
mem1 = psutil.virtual_memory().used
print(f"{mem1}: + {mem1 - mem0}")
The pandas iloc returns a new frame, that shares the data with the original one, no new memory for the data is allocated. In modin hdk this iloc results in 4 calls to to_pandas(). I.e., we create 4 new pandas frames, 3 of them are garbage collected and the last one is saved in the partitions of the new hdk frame.
https://github.com/modin-project/modin/issues/7234