modin icon indicating copy to clipboard operation
modin copied to clipboard

Does the Fold operator allow a function to change the shape of partitions?

Open zmbc opened this issue 2 years ago • 10 comments

On the Operators Module Description page, the Map operator has a note that "map function should not change the shape of the partitions."

The Fold operator has no such note, but when I try to run an example that changes the shape, it doesn't work:

import pandas
import modin.pandas as pd

modin_df = pd.DataFrame(pandas.DataFrame({"a": range(0, 1_000), "b": range(500, 1_500)}))

from modin.core.storage_formats import PandasQueryCompiler
from modin.core.dataframe.algebra import Fold

PandasQueryCompiler.filter_func = Fold.register(lambda df: df[df.index % 2 == 0])


def filter_modin_dataframe(df):
    return df.__constructor__(
        query_compiler=df._query_compiler.filter_func(
            fold_axis=1,
        )
    )


pd.DataFrame.filter_dataframe = filter_modin_dataframe

filtered_df = modin_df.filter_dataframe()

print(filtered_df)

fails with IndexError: positional indexers are out-of-bounds.

Is this intentional? If so, what would be the recommended way to do a fold-like operation, applying a function that requires knowledge of an entire axis, and creating a DataFrame from the resulting partitions?

Note that in this example I am only changing the shape along the axis I am folding on, so it's impossible for this to cause an illogical outcome such as some rows having more columns than others. I don't see a reason why the opposite shouldn't be allowed as well, as long as it doesn't result in such an illogical outcome.

zmbc avatar Aug 22 '23 20:08 zmbc

By the way, the equivalent functionality in Dask Dataframe is .map_partitions, which because Dask has 1D partitions corresponds to a Modin Fold with fold_axis=1. It says in the Dask docs that "the index and divisions are assumed to remain unchanged" but it doesn't actually stop you from changing them.

Here is an example of how the code above can work with Dask:

if __name__ == "__main__":
    from distributed import Client

    client = Client(n_workers=2, threads_per_worker=1)

    import pandas
    import dask.dataframe as dd

    dask_df = dd.from_pandas(
        pandas.DataFrame({"a": range(0, 1_000), "b": range(500, 1_500)}), npartitions=2
    )

    filtered_df = dask_df.map_partitions(lambda df: df[df.index % 2 == 0])

    print(filtered_df.compute())

zmbc avatar Aug 22 '23 21:08 zmbc

I found this workaround, though it uses private, undocumented attributes:

import pandas
import modin.pandas as pd

modin_df = pd.DataFrame(pandas.DataFrame({"a": range(0, 1_000), "b": range(500, 1_500)}))

from modin.core.storage_formats import PandasQueryCompiler

filtered_df = pd.DataFrame(
    query_compiler=PandasQueryCompiler(
        modin_df._query_compiler._modin_frame.apply_full_axis(
            axis=1, func=lambda df: df[df.index % 2 == 0]
        )
    )
)

print(filtered_df)

Note: I am of course aware that none of this necessary to get only the even-numbered rows of a dataframe, that is just a stand-in for an arbitrary computation that maps DataFrame -> DataFrame with a different shape.

zmbc avatar Aug 22 '23 22:08 zmbc

The Fold operator has no such note, but when I try to run an example that changes the shape, it doesn't work:

Hi @zmbc! Good observation.

This doesn't work because we are unconditionally copying the index metadata, relying on the shape of the result being unchanged. If they are not copied, then your example with the Fold operator starts working. https://github.com/modin-project/modin/blob/29d9da056a4e4766314c4ae9ea62b33e0d52e662/modin/core/dataframe/pandas/dataframe/dataframe.py#L2118-L2124

I guess we could implement this operator by adding an additional flag that would control when the metadata should be copied. @dchigarev what do you think?

anmyachev avatar Aug 23 '23 10:08 anmyachev

Ah, I see. Is that the same reason why the Map operator requires that the shape doesn't change?

I do think it could be really useful to have operators that are more flexible (but of course slower), like Map or Fold but allowed to change the shape of each partition. Copying the metadata is a nice optimization when the shape doesn't change, but it doesn't seem like it would be absolutely critical to performance in most cases.

zmbc avatar Aug 23 '23 15:08 zmbc

Ah, I see. Is that the same reason why the Map operator requires that the shape doesn't change?

Looks like that.

I do think it could be really useful to have operators that are more flexible (but of course slower), like Map or Fold but allowed to change the shape of each partition. Copying the metadata is a nice optimization when the shape doesn't change, but it doesn't seem like it would be absolutely critical to performance in most cases.

Agree.

anmyachev avatar Aug 24 '23 12:08 anmyachev

It looks like filter operator is suitable here. @anmyachev, do you think we should lift it up to the algebra module?

YarShev avatar Apr 24 '24 14:04 YarShev

It looks like filter operator is suitable here. @anmyachev, do you think we should lift it up to the algebra module?

Could you write in more detail? It seems to me that it is possible to implement operators Map and Fold more flexible, as @zmbc wrote about.

anmyachev avatar Apr 25 '24 12:04 anmyachev

PandasQueryCompiler.filter_func = Fold.register(lambda df: df[df.index % 2 == 0]) is a normal filter operator. We have filter method at the Modin Dataframe layer but we don't have it in algebra module so we can add it there. As to Map and Fold operators, I wouldn't like to change their definition since they have already been documented well in different articles/papers.

YarShev avatar Apr 25 '24 22:04 YarShev

@YarShev the df[df.index % 2 == 0] was just a toy example. In my actual use case, I want to do an operation that filters and changes values and adds/drops columns and changes dtypes.

zmbc avatar May 06 '24 16:05 zmbc

As to Map and Fold operators, I wouldn't like to change their definition since they have already been documented well in different articles/papers.

@YarShev regarding Fold operator, at the algebra level we did not mention anywhere that it should preserve the shape of a dataframe (I didn’t find this in Devin’s articles either). Therefore, by allowing it to change shape, we give users flexibility and do not contradict what we have already written.

I guess we could implement this operator by adding an additional flag that would control when the metadata should be copied.

I still think this approach might be acceptable here.

anmyachev avatar May 07 '24 09:05 anmyachev

We have a note in the docstring: "The data shape is not changed (length and width of the table).". However, I agree with you and we can expand Fold operator to be more flexible.

YarShev avatar May 12 '24 20:05 YarShev