modin BUG: groupby().apply() raise numpy ValueError when Series has multi index

BUG: groupby().apply() raise numpy ValueError when Series has multi index

Open Pekton opened this issue 7 months ago • 1 comments

Modin version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest released version of Modin.
[X] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as pd

data1=pd.read_excel('abc.xlsx', header=[0,1]) # multiple headers

def anyFuncB(x):
    do something
    return x

def anyFuncA(x)
    x.loc[data1[('col0','col1')].apply(anyFuncB)] #here cause the error, apply() results in a pd.Series

data = pd.read_excel('def.xlsx')
data.groupby(by='col0').apply(anyFuncA)

Issue Description

By just applying dataframe0.apply(anyFunc0), everything was good.

After applying dataframe0.groupby().apply(anyFunc0), if another dataframe1 has multi index and it runs dataframe1[('col0', 'col1')].apply(anyFunc1), File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/series.py", line 713, in apply if result.name == self.index[0]: raises ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(), because here result.name is a tuple with 2 items and self.index[0] is a numpy.int64, the result of comparison is a list contents two boolean values, my temp fix is adding following code:

elif return_type == "Series":
    try:
        if result.name == self.index[0]:
            result.name = None
    except:
        if (result.name == self.index[0]).all():
            result.name = None

other solution could be to determine if result.name and self.index[0] is single value or not.

Expected Behavior

make the comparison correct

Error Logs


Traceback (most recent call last):
  File "/home/ecommerce_production_classification/database.py", line 46, in <module>
    print(data.loc[:5].groupby(by='company_id').apply(lambda x: detect_data(x)))
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/groupby.py", line 653, in apply
    if not isinstance(apply_res, Series) and apply_res.columns.equals(
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/base.py", line 4294, in __getattribute__
    attr = super().__getattribute__(item)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/dataframe.py", line 315, in _get_columns
    return self._query_compiler.columns
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 104, in <lambda>
    return lambda self: self._modin_frame.columns
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 727, in _get_columns
    columns, column_widths = self._columns_cache.get(return_lengths=True)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/metadata/index.py", line 194, in get
    index, self._lengths_cache = self._value()
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/metadata/index.py", line 106, in <lambda>
    return lambda: dataframe_obj._compute_axis_labels_and_lengths(axis)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 835, in _compute_axis_labels_and_lengths
    new_index, internal_idx = self._partition_mgr_cls.get_indices(axis, partitions)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1193, in get_indices
    new_idx = cls.get_objects_from_partitions(new_idx)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1134, in get_objects_from_partitions
    return cls._execution_wrapper.materialize(
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/execution/ray/common/engine_wrapper.py", line 139, in materialize
    return ray.get(obj_id)
  File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/worker.py", line 2630, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/worker.py", line 863, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::remote_exec_func() (pid=22666, ip=172.29.158.228)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_deploy_ray_func() (pid=22664, ip=172.29.158.228)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py", line 335, in _deploy_ray_func
    result = deployer(axis, f_to_deploy, f_args, f_kwargs, *deploy_args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/axis_partition.py", line 462, in deploy_axis_func
    raise err
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/axis_partition.py", line 457, in deploy_axis_func
    result = func(dataframe, *f_args, **f_kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 2078, in _tree_reduce_func
    series_result = func(df, *args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 4261, in apply_func
    result = operator(df.groupby(by, **kwargs))
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 3976, in <lambda>
    operator=lambda grp: agg_func(grp, *agg_args, **agg_kwargs),
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 3957, in agg_func
    result = agg_method(grp, original_agg_func, *args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/groupby/groupby.py", line 1824, in apply
    result = self._python_apply_general(f, self._selected_obj)
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/groupby/groupby.py", line 1885, in _python_apply_general
    values, mutated = self._grouper.apply_groupwise(f, data, self.axis)
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/groupby/ops.py", line 919, in apply_groupwise
    res = f(group)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/utils.py", line 765, in wrapper
    result = func(*args, **kwargs)
  File "/home/ecommerce_production_classification/database.py", line 46, in <lambda>
    print(data.loc[:5].groupby(by='company_id').apply(lambda x: detect_data(x)))
  File "/home/ecommerce_production_classification/database.py", line 21, in detect_data
    return classification(data, _rulesDF)
  File "/home/ecommerce_production_classification/categorization.py", line 227, in classification
    data = categorization(data, rules)
  File "/home/ecommerce_production_classification/categorization.py", line 209, in categorization
    return process(data, rules,  '分类')
  File "/home/ecommerce_production_classification/categorization.py", line 205, in process
    data[rules['赋值'].columns]=pd.DataFrame(data.apply(getCategories, axis=1).to_dict()).T
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 10374, in apply
    return op.apply().__finalize__(self, method="apply")
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 916, in apply
    return self.apply_standard()
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 1063, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator
    results[i] = self.func(v, *self.args, **self.kwargs)
  File "/home/ecommerce_production_classification/categorization.py", line 168, in getCategories
    _res = rules.loc[rules[('运算式','运算式')].apply(operationToBool)]
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/series.py", line 713, in apply
    if result.name == self.index[0]:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Installed Versions

UserWarning: Setuptools is replacing distutils.

INSTALLED VERSIONS

commit : c8bbca8e4e00c681370e3736b2f73bb0352408c3 python : 3.10.12.final.0 python-bits : 64 OS : Linux OS-release : 3.10.0-1160.108.1.el7.x86_64 Version : #1 SMP Thu Jan 25 16:17:31 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

Modin dependencies

modin : 0.31.0 ray : 2.30.0 dask : None distributed : None

pandas dependencies

pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 65.5.0 pip : 24.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : 1.4.6 psycopg2 : None jinja2 : 3.1.4 IPython : 8.26.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.6.0 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.4 pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 2.0.31 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Jul 16 '24 03:07 Pekton

modin modin copied to clipboard

BUG: groupby().apply() raise numpy ValueError when Series has multi index

Modin version checks

Reproducible Example

Issue Description

Expected Behavior

Error Logs

Installed Versions

INSTALLED VERSIONS

Modin dependencies

pandas dependencies

modin
modin copied to clipboard