modin
modin copied to clipboard
Expand implementation for DataFrame constructor to make possible construct from dictionaries with Modin entities as values
Describe the problem
We should expand implementation of DataFrame constructor to make possible to create Modin DataFrame from dictionaries with Modin Series as dict values with fast way. For now we have the follow warning:
UserWarning: Distributing <class 'dict'> object. This may take some time.
from time import time as timer
import numpy as np
# import pandas as pd
import modin.pandas as pd
import ray
ray.init()
nrows = 1000_000_000
df = pd.DataFrame({"a": np.random.rand(nrows), "b": np.random.rand(nrows)})
t = timer()
df2 = pd.DataFrame({"c": df.a})
print(f'df creation time: {timer() - t} s')
The result on 112 CPUs, Ray engine:
df creation time: 3.937314748764038 s # Pandas is used
df creation time: 24.079696655273438 s # Modin is used
What about cpu count? Engine?
What about cpu count? Engine?
The info related of cpu count was added in the PR description. The execution engine you can see in the reproducer - Ray.
Connected with https://github.com/modin-project/modin/issues/1572
Note that after certain changes where we started to define runtime environment for Ray, we should properly exclude Ray init time from measurement.
The line I personally use to initialize all the workers is this:
pd.DataFrame(range(cfg.CpuCount.get() * cfg.MinPartitionSize().get())).to_numpy() # init the engine and start all the workers
The following snippet:
from time import time as timer
import numpy as np
# import pandas as pd
import modin.pandas as pd
import modin.config as cfg
pd.DataFrame(range(cfg.CpuCount.get() * cfg.MinPartitionSize().get())).to_numpy() # init the engine and start all the workers
nrows = 100_000_000
df = pd.DataFrame({"a": np.random.rand(nrows), "b": np.random.rand(nrows)})
t = timer()
df2 = pd.DataFrame({"c": df.a})
print(f'df creation time: {timer() - t} s')
repr(df2)
print(f'df creation + sync time: {timer() - t} s')
produces the following timings:
df creation time: 1.973231315612793 s
df creation + sync time: 2.2475578784942627 s
on 0.15.2 and MODIN_CPUS=12
The #5193 introduces a fast way only for cases when all of the dictionary values are modin Series's. Thus reopening the issue to indicate that the implementation for other cases is still missing.
Thus reopening the issue to indicate that the implementation for other cases is still missing.
@dchigarev what missing cases do you mean?
Thus reopening the issue to indicate that the implementation for other cases is still missing.
@dchigarev what missing cases do you mean?
I meant that the new implementation only works for Ray and only with numerical types [1]
I meant that the new implementation only works for Ray and only with numerical types [1]
@dchigarev you seem to be talking about another pull request :).
The #5193 introduces a fast way only for cases when all of the dictionary values are modin Series's. Thus reopening the issue to indicate that the implementation for other cases is still missing.
I am asking about #5193.
Ah, yeah, I misunderstood initially :)
As for missed cases for #5193, I meant the ones that have mixed values in the dictionary, like modin Series'es and something else:
sr = pd.DataFrame({"a": 1, "b": modin_series}) # defaults to pandas