modin icon indicating copy to clipboard operation
modin copied to clipboard

Expand implementation for DataFrame constructor to make possible construct from dictionaries with Modin entities as values

Open prutskov opened this issue 3 years ago • 10 comments

Describe the problem

We should expand implementation of DataFrame constructor to make possible to create Modin DataFrame from dictionaries with Modin Series as dict values with fast way. For now we have the follow warning:

UserWarning: Distributing <class 'dict'> object. This may take some time.
from time import time as timer

import numpy as np
# import pandas as pd
import modin.pandas as pd
import ray
ray.init()

nrows = 1000_000_000
df = pd.DataFrame({"a": np.random.rand(nrows), "b": np.random.rand(nrows)})

t = timer()
df2 = pd.DataFrame({"c": df.a})
print(f'df creation time: {timer() - t} s')

The result on 112 CPUs, Ray engine:

df creation time: 3.937314748764038 s   # Pandas is used
df creation time: 24.079696655273438 s # Modin is used

prutskov avatar Feb 24 '22 15:02 prutskov

What about cpu count? Engine?

anmyachev avatar Feb 24 '22 15:02 anmyachev

What about cpu count? Engine?

The info related of cpu count was added in the PR description. The execution engine you can see in the reproducer - Ray.

prutskov avatar Feb 24 '22 16:02 prutskov

Connected with https://github.com/modin-project/modin/issues/1572

prutskov avatar Mar 04 '22 13:03 prutskov

Note that after certain changes where we started to define runtime environment for Ray, we should properly exclude Ray init time from measurement.

The line I personally use to initialize all the workers is this:

pd.DataFrame(range(cfg.CpuCount.get() * cfg.MinPartitionSize().get())).to_numpy() # init the engine and start all the workers

vnlitvinov avatar Aug 29 '22 22:08 vnlitvinov

The following snippet:

from time import time as timer

import numpy as np
# import pandas as pd
import modin.pandas as pd
import modin.config as cfg

pd.DataFrame(range(cfg.CpuCount.get() * cfg.MinPartitionSize().get())).to_numpy() # init the engine and start all the workers

nrows = 100_000_000
df = pd.DataFrame({"a": np.random.rand(nrows), "b": np.random.rand(nrows)})

t = timer()
df2 = pd.DataFrame({"c": df.a})
print(f'df creation time: {timer() - t} s')
repr(df2)
print(f'df creation + sync time: {timer() - t} s')

produces the following timings:

df creation time: 1.973231315612793 s
df creation + sync time: 2.2475578784942627 s

on 0.15.2 and MODIN_CPUS=12

vnlitvinov avatar Aug 29 '22 22:08 vnlitvinov

The #5193 introduces a fast way only for cases when all of the dictionary values are modin Series's. Thus reopening the issue to indicate that the implementation for other cases is still missing.

dchigarev avatar Nov 15 '22 11:11 dchigarev

Thus reopening the issue to indicate that the implementation for other cases is still missing.

@dchigarev what missing cases do you mean?

anmyachev avatar Oct 29 '23 16:10 anmyachev

Thus reopening the issue to indicate that the implementation for other cases is still missing.

@dchigarev what missing cases do you mean?

I meant that the new implementation only works for Ray and only with numerical types [1]

dchigarev avatar Nov 06 '23 08:11 dchigarev

I meant that the new implementation only works for Ray and only with numerical types [1]

@dchigarev you seem to be talking about another pull request :).

The #5193 introduces a fast way only for cases when all of the dictionary values are modin Series's. Thus reopening the issue to indicate that the implementation for other cases is still missing.

I am asking about #5193.

anmyachev avatar Nov 21 '23 13:11 anmyachev

Ah, yeah, I misunderstood initially :)

As for missed cases for #5193, I meant the ones that have mixed values in the dictionary, like modin Series'es and something else:

sr = pd.DataFrame({"a": 1, "b": modin_series}) # defaults to pandas

dchigarev avatar Nov 21 '23 15:11 dchigarev