modin icon indicating copy to clipboard operation
modin copied to clipboard

BUG: groupby().apply() raise numpy ValueError when Series has multi index

Open Pekton opened this issue 7 months ago • 1 comments

Modin version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest released version of Modin.

  • [X] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as pd

data1=pd.read_excel('abc.xlsx', header=[0,1]) # multiple headers

def anyFuncB(x):
    do something
    return x

def anyFuncA(x)
    x.loc[data1[('col0','col1')].apply(anyFuncB)] #here cause the error, apply() results in a pd.Series

data = pd.read_excel('def.xlsx')
data.groupby(by='col0').apply(anyFuncA)

Issue Description

By just applying dataframe0.apply(anyFunc0), everything was good.

After applying dataframe0.groupby().apply(anyFunc0), if another dataframe1 has multi index and it runs dataframe1[('col0', 'col1')].apply(anyFunc1), File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/series.py", line 713, in apply if result.name == self.index[0]: raises ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(), because here result.name is a tuple with 2 items and self.index[0] is a numpy.int64, the result of comparison is a list contents two boolean values, my temp fix is adding following code:

elif return_type == "Series":
    try:
        if result.name == self.index[0]:
            result.name = None
    except:
        if (result.name == self.index[0]).all():
            result.name = None

other solution could be to determine if result.name and self.index[0] is single value or not.

Expected Behavior

make the comparison correct

Error Logs


Traceback (most recent call last):
  File "/home/ecommerce_production_classification/database.py", line 46, in <module>
    print(data.loc[:5].groupby(by='company_id').apply(lambda x: detect_data(x)))
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/groupby.py", line 653, in apply
    if not isinstance(apply_res, Series) and apply_res.columns.equals(
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/base.py", line 4294, in __getattribute__
    attr = super().__getattribute__(item)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/dataframe.py", line 315, in _get_columns
    return self._query_compiler.columns
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 104, in <lambda>
    return lambda self: self._modin_frame.columns
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 727, in _get_columns
    columns, column_widths = self._columns_cache.get(return_lengths=True)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/metadata/index.py", line 194, in get
    index, self._lengths_cache = self._value()
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/metadata/index.py", line 106, in <lambda>
    return lambda: dataframe_obj._compute_axis_labels_and_lengths(axis)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 835, in _compute_axis_labels_and_lengths
    new_index, internal_idx = self._partition_mgr_cls.get_indices(axis, partitions)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1193, in get_indices
    new_idx = cls.get_objects_from_partitions(new_idx)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1134, in get_objects_from_partitions
    return cls._execution_wrapper.materialize(
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/execution/ray/common/engine_wrapper.py", line 139, in materialize
    return ray.get(obj_id)
  File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/worker.py", line 2630, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/worker.py", line 863, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::remote_exec_func() (pid=22666, ip=172.29.158.228)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_deploy_ray_func() (pid=22664, ip=172.29.158.228)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py", line 335, in _deploy_ray_func
    result = deployer(axis, f_to_deploy, f_args, f_kwargs, *deploy_args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/axis_partition.py", line 462, in deploy_axis_func
    raise err
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/axis_partition.py", line 457, in deploy_axis_func
    result = func(dataframe, *f_args, **f_kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 2078, in _tree_reduce_func
    series_result = func(df, *args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 4261, in apply_func
    result = operator(df.groupby(by, **kwargs))
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 3976, in <lambda>
    operator=lambda grp: agg_func(grp, *agg_args, **agg_kwargs),
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 3957, in agg_func
    result = agg_method(grp, original_agg_func, *args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/groupby/groupby.py", line 1824, in apply
    result = self._python_apply_general(f, self._selected_obj)
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/groupby/groupby.py", line 1885, in _python_apply_general
    values, mutated = self._grouper.apply_groupwise(f, data, self.axis)
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/groupby/ops.py", line 919, in apply_groupwise
    res = f(group)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/utils.py", line 765, in wrapper
    result = func(*args, **kwargs)
  File "/home/ecommerce_production_classification/database.py", line 46, in <lambda>
    print(data.loc[:5].groupby(by='company_id').apply(lambda x: detect_data(x)))
  File "/home/ecommerce_production_classification/database.py", line 21, in detect_data
    return classification(data, _rulesDF)
  File "/home/ecommerce_production_classification/categorization.py", line 227, in classification
    data = categorization(data, rules)
  File "/home/ecommerce_production_classification/categorization.py", line 209, in categorization
    return process(data, rules,  '分类')
  File "/home/ecommerce_production_classification/categorization.py", line 205, in process
    data[rules['赋值'].columns]=pd.DataFrame(data.apply(getCategories, axis=1).to_dict()).T
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 10374, in apply
    return op.apply().__finalize__(self, method="apply")
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 916, in apply
    return self.apply_standard()
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 1063, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator
    results[i] = self.func(v, *self.args, **self.kwargs)
  File "/home/ecommerce_production_classification/categorization.py", line 168, in getCategories
    _res = rules.loc[rules[('运算式','运算式')].apply(operationToBool)]
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/series.py", line 713, in apply
    if result.name == self.index[0]:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Installed Versions

UserWarning: Setuptools is replacing distutils.

INSTALLED VERSIONS

commit : c8bbca8e4e00c681370e3736b2f73bb0352408c3 python : 3.10.12.final.0 python-bits : 64 OS : Linux OS-release : 3.10.0-1160.108.1.el7.x86_64 Version : #1 SMP Thu Jan 25 16:17:31 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

Modin dependencies

modin : 0.31.0 ray : 2.30.0 dask : None distributed : None

pandas dependencies

pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 65.5.0 pip : 24.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : 1.4.6 psycopg2 : None jinja2 : 3.1.4 IPython : 8.26.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.6.0 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.4 pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 2.0.31 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Pekton avatar Jul 16 '24 03:07 Pekton