modin
modin copied to clipboard
Does the Fold operator allow a function to change the shape of partitions?
On the Operators Module Description page, the Map operator has a note that "map function should not change the shape of the partitions."
The Fold operator has no such note, but when I try to run an example that changes the shape, it doesn't work:
import pandas
import modin.pandas as pd
modin_df = pd.DataFrame(pandas.DataFrame({"a": range(0, 1_000), "b": range(500, 1_500)}))
from modin.core.storage_formats import PandasQueryCompiler
from modin.core.dataframe.algebra import Fold
PandasQueryCompiler.filter_func = Fold.register(lambda df: df[df.index % 2 == 0])
def filter_modin_dataframe(df):
return df.__constructor__(
query_compiler=df._query_compiler.filter_func(
fold_axis=1,
)
)
pd.DataFrame.filter_dataframe = filter_modin_dataframe
filtered_df = modin_df.filter_dataframe()
print(filtered_df)
fails with IndexError: positional indexers are out-of-bounds.
Is this intentional? If so, what would be the recommended way to do a fold-like operation, applying a function that requires knowledge of an entire axis, and creating a DataFrame from the resulting partitions?
Note that in this example I am only changing the shape along the axis I am folding on, so it's impossible for this to cause an illogical outcome such as some rows having more columns than others. I don't see a reason why the opposite shouldn't be allowed as well, as long as it doesn't result in such an illogical outcome.
By the way, the equivalent functionality in Dask Dataframe is .map_partitions, which because Dask has 1D partitions corresponds to a Modin Fold with fold_axis=1. It says in the Dask docs that "the index and divisions are assumed to remain unchanged" but it doesn't actually stop you from changing them.
Here is an example of how the code above can work with Dask:
if __name__ == "__main__":
from distributed import Client
client = Client(n_workers=2, threads_per_worker=1)
import pandas
import dask.dataframe as dd
dask_df = dd.from_pandas(
pandas.DataFrame({"a": range(0, 1_000), "b": range(500, 1_500)}), npartitions=2
)
filtered_df = dask_df.map_partitions(lambda df: df[df.index % 2 == 0])
print(filtered_df.compute())
I found this workaround, though it uses private, undocumented attributes:
import pandas
import modin.pandas as pd
modin_df = pd.DataFrame(pandas.DataFrame({"a": range(0, 1_000), "b": range(500, 1_500)}))
from modin.core.storage_formats import PandasQueryCompiler
filtered_df = pd.DataFrame(
query_compiler=PandasQueryCompiler(
modin_df._query_compiler._modin_frame.apply_full_axis(
axis=1, func=lambda df: df[df.index % 2 == 0]
)
)
)
print(filtered_df)
Note: I am of course aware that none of this necessary to get only the even-numbered rows of a dataframe, that is just a stand-in for an arbitrary computation that maps DataFrame -> DataFrame with a different shape.
The Fold operator has no such note, but when I try to run an example that changes the shape, it doesn't work:
Hi @zmbc! Good observation.
This doesn't work because we are unconditionally copying the index metadata, relying on the shape of the result being unchanged. If they are not copied, then your example with the Fold operator starts working. https://github.com/modin-project/modin/blob/29d9da056a4e4766314c4ae9ea62b33e0d52e662/modin/core/dataframe/pandas/dataframe/dataframe.py#L2118-L2124
I guess we could implement this operator by adding an additional flag that would control when the metadata should be copied. @dchigarev what do you think?
Ah, I see. Is that the same reason why the Map operator requires that the shape doesn't change?
I do think it could be really useful to have operators that are more flexible (but of course slower), like Map or Fold but allowed to change the shape of each partition. Copying the metadata is a nice optimization when the shape doesn't change, but it doesn't seem like it would be absolutely critical to performance in most cases.
Ah, I see. Is that the same reason why the Map operator requires that the shape doesn't change?
Looks like that.
I do think it could be really useful to have operators that are more flexible (but of course slower), like Map or Fold but allowed to change the shape of each partition. Copying the metadata is a nice optimization when the shape doesn't change, but it doesn't seem like it would be absolutely critical to performance in most cases.
Agree.
It looks like filter operator is suitable here. @anmyachev, do you think we should lift it up to the algebra module?
It looks like filter operator is suitable here. @anmyachev, do you think we should lift it up to the algebra module?
Could you write in more detail? It seems to me that it is possible to implement operators Map and Fold more flexible, as @zmbc wrote about.
PandasQueryCompiler.filter_func = Fold.register(lambda df: df[df.index % 2 == 0]) is a normal filter operator. We have filter method at the Modin Dataframe layer but we don't have it in algebra module so we can add it there. As to Map and Fold operators, I wouldn't like to change their definition since they have already been documented well in different articles/papers.
@YarShev the df[df.index % 2 == 0] was just a toy example. In my actual use case, I want to do an operation that filters and changes values and adds/drops columns and changes dtypes.
As to Map and Fold operators, I wouldn't like to change their definition since they have already been documented well in different articles/papers.
@YarShev regarding Fold operator, at the algebra level we did not mention anywhere that it should preserve the shape of a dataframe (I didn’t find this in Devin’s articles either). Therefore, by allowing it to change shape, we give users flexibility and do not contradict what we have already written.
I guess we could implement this operator by adding an additional flag that would control when the metadata should be copied.
I still think this approach might be acceptable here.
We have a note in the docstring: "The data shape is not changed (length and width of the table).". However, I agree with you and we can expand Fold operator to be more flexible.