modin icon indicating copy to clipboard operation
modin copied to clipboard

PERF-#6609: HDK: to_pandas(): Cache pandas DataFrame

Open AndreyPavlenko opened this issue 2 years ago • 9 comments
trafficstars

What do these changes do?

  • [x] first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • [x] passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • [x] passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • [x] signed commit with git commit -s
  • [x] Resolves #6609
  • [ ] tests added and passing
  • [ ] module layout described at docs/development/architecture.rst is up-to-date

AndreyPavlenko avatar Sep 27 '23 14:09 AndreyPavlenko

Adding a cache to this method could significantly improve performance of the methods, that are defaulting to pandas.

but it also doubles the memory consumption, doesn't it?

dchigarev avatar Sep 27 '23 16:09 dchigarev

but it also doubles the memory consumption, doesn't it?

I think, it depends on the dataset. Some data could be shared between HDK, Arrow and Pandas. Form the other side, if we do no cache the pandas frame, we may have multiple copies of identical frames in the memory.

Here is a simple test, demonstrating the memory usage:

import psutil
import modin.pandas as pd

df = pd.DataFrame({"a": range(100000000)})
df = df.dropna() # Ensure the table is imported to HDK
mem0 = psutil.virtual_memory().used
print(f"{mem0}")
pdf1 = df._to_pandas()
mem1 = psutil.virtual_memory().used
print(f"{mem1}: + {mem1 - mem0}")
pdf2 = df._to_pandas()
mem2 = psutil.virtual_memory().used
print(f"{mem2}: + {mem2 - mem1}")
pdf3 = df._to_pandas()
mem3 = psutil.virtual_memory().used
print(f"{mem3}: + {mem3 - mem2}")

Output on the master branch:

10649378816
14785376256: + 4135997440
16392810496: + 1607434240
17996320768: + 1603510272

Output on this branch

10598752256
14746906624: + 4148154368
14746906624: + 0
14746906624: + 0

AndreyPavlenko avatar Sep 27 '23 20:09 AndreyPavlenko

but it also doubles the memory consumption, doesn't it?

I think, it depends on the dataset. Some data could be shared between HDK, Arrow and Pandas. Form the other side, if we do no cache the pandas frame, we may have multiple copies of identical frames in the memory.

Here is a simple test, demonstrating the memory usage:

But note, that in real life we don't usually keep references on pandas dfs once the default-to-pandas operation is done, so to make this scenario more realistic we should delete pdf after each measurement:

import psutil
import modin.pandas as pd

df = pd.DataFrame({"a": range(100000000)})
df = df.dropna() # Ensure the table is imported to HDK
mem0 = psutil.virtual_memory().used
print(f"{mem0}")
pdf1 = df._to_pandas()
mem1 = psutil.virtual_memory().used
print(f"{mem1}: + {mem1 - mem0}")
del pdf1

pdf2 = df._to_pandas()
mem2 = psutil.virtual_memory().used
print(f"{mem2}: + {mem2 - mem1}")
del pdf2

pdf3 = df._to_pandas()
mem3 = psutil.virtual_memory().used
print(f"{mem3}: + {mem3 - mem2}")
del pdf3

Then on master I get:

8689057792
12090052608: + 3400994816
12083097600: + -6955008
11347464192: + -735633408
(the memory consumption decreases over the calls?)

And for your branch it's:

8684437504
12864876544: + 4180439040
12864876544: + 0
12864876544: + 0

dchigarev avatar Sep 28 '23 10:09 dchigarev

but it also doubles the memory consumption, doesn't it?

I think, it depends on the dataset. Some data could be shared between HDK, Arrow and Pandas.

Well, if an arrow table with certain data can be converted to pandas by simply sharing its buffer, then shouldn't such conversion be almost free? Do you know columns with what data types can be converted that easy way?

dchigarev avatar Sep 28 '23 10:09 dchigarev

But note, that in real life we don't usually keep references on pandas dfs once the default-to-pandas operation is done

It depends ... For example, in case of an unsupported data, the pandas df will be saved in partitions of the new HDK frame. Also, in this implementation, if an HDK frame is created from a Pandas frame, the Pandas frame is always saved in partitions. It's converted to arrow lazy, only when exporting to HDK.

AndreyPavlenko avatar Sep 28 '23 10:09 AndreyPavlenko

For example, in case of an unsupported data, the pandas df will be saved in partitions of the new HDK frame.

Right, this is done so we wouldn't do unnecessary .to_pandas() conversions since we know in advance that all operations will default to pandas. However, how is this related to the changes in this PR? How caching pandas df in normal HDK frames could help in this case?

Also, in https://github.com/modin-project/modin/pull/6412 implementation, if an HDK frame is created from a Pandas frame, the Pandas frame is always saved in partitions. It's converted to arrow lazy, only when exporting to HDK.

This optimization is quite good, but again, how is this related to this PR?

dchigarev avatar Sep 28 '23 11:09 dchigarev

how is this related to this PR?

Not related. These are just a few examples of when we do keep references on pandas dfs.

AndreyPavlenko avatar Sep 28 '23 11:09 AndreyPavlenko

how is this related to this PR?

Not related. These are just a few examples of when we do keep references on pandas dfs.

I understand that, but in those examples pandas dfs origin not from the .to_pandas() call but from the user, right? My original question was regarding keeping in memory the results of .to_pandas() after a default-to-pandas function is done.

dchigarev avatar Sep 28 '23 11:09 dchigarev

but in those examples pandas dfs origin not from the .to_pandas() call but from the user, right?

Not necessary. The frame, returned by to_pandas(), is used to build a new modin frame. Here is an example:

import psutil
import pandas as pd
# import modin.pandas as pd

df = pd.DataFrame(range(1000000), columns=pd.MultiIndex.from_tuples([(1,2,3)]))
mem0 = psutil.virtual_memory().used
df2 = df.iloc[:-1]
mem1 = psutil.virtual_memory().used
print(f"{mem1}: + {mem1 - mem0}")

The pandas iloc returns a new frame, that shares the data with the original one, no new memory for the data is allocated. In modin hdk this iloc results in 4 calls to to_pandas(). I.e., we create 4 new pandas frames, 3 of them are garbage collected and the last one is saved in the partitions of the new hdk frame.

AndreyPavlenko avatar Sep 28 '23 15:09 AndreyPavlenko

https://github.com/modin-project/modin/issues/7234

anmyachev avatar May 05 '24 16:05 anmyachev