modin
modin copied to clipboard
BUG: groupby().apply() raise numpy ValueError when Series has multi index
Modin version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest released version of Modin.
-
[X] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
import modin.pandas as pd
data1=pd.read_excel('abc.xlsx', header=[0,1]) # multiple headers
def anyFuncB(x):
do something
return x
def anyFuncA(x)
x.loc[data1[('col0','col1')].apply(anyFuncB)] #here cause the error, apply() results in a pd.Series
data = pd.read_excel('def.xlsx')
data.groupby(by='col0').apply(anyFuncA)
Issue Description
By just applying dataframe0.apply(anyFunc0), everything was good.
After applying dataframe0.groupby().apply(anyFunc0), if another dataframe1 has multi index and it runs dataframe1[('col0', 'col1')].apply(anyFunc1), File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/series.py", line 713, in apply if result.name == self.index[0]: raises ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(), because here result.name is a tuple with 2 items and self.index[0] is a numpy.int64, the result of comparison is a list contents two boolean values, my temp fix is adding following code:
elif return_type == "Series":
try:
if result.name == self.index[0]:
result.name = None
except:
if (result.name == self.index[0]).all():
result.name = None
other solution could be to determine if result.name and self.index[0] is single value or not.
Expected Behavior
make the comparison correct
Error Logs
Traceback (most recent call last):
File "/home/ecommerce_production_classification/database.py", line 46, in <module>
print(data.loc[:5].groupby(by='company_id').apply(lambda x: detect_data(x)))
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
return obj(*args, **kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/groupby.py", line 653, in apply
if not isinstance(apply_res, Series) and apply_res.columns.equals(
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/base.py", line 4294, in __getattribute__
attr = super().__getattribute__(item)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/dataframe.py", line 315, in _get_columns
return self._query_compiler.columns
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 104, in <lambda>
return lambda self: self._modin_frame.columns
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 727, in _get_columns
columns, column_widths = self._columns_cache.get(return_lengths=True)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/metadata/index.py", line 194, in get
index, self._lengths_cache = self._value()
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/metadata/index.py", line 106, in <lambda>
return lambda: dataframe_obj._compute_axis_labels_and_lengths(axis)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
return obj(*args, **kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 835, in _compute_axis_labels_and_lengths
new_index, internal_idx = self._partition_mgr_cls.get_indices(axis, partitions)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
return obj(*args, **kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1193, in get_indices
new_idx = cls.get_objects_from_partitions(new_idx)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
return obj(*args, **kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1134, in get_objects_from_partitions
return cls._execution_wrapper.materialize(
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/execution/ray/common/engine_wrapper.py", line 139, in materialize
return ray.get(obj_id)
File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/worker.py", line 2630, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/worker.py", line 863, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::remote_exec_func() (pid=22666, ip=172.29.158.228)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_deploy_ray_func() (pid=22664, ip=172.29.158.228)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py", line 335, in _deploy_ray_func
result = deployer(axis, f_to_deploy, f_args, f_kwargs, *deploy_args, **kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
return obj(*args, **kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/axis_partition.py", line 462, in deploy_axis_func
raise err
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/axis_partition.py", line 457, in deploy_axis_func
result = func(dataframe, *f_args, **f_kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 2078, in _tree_reduce_func
series_result = func(df, *args, **kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 4261, in apply_func
result = operator(df.groupby(by, **kwargs))
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 3976, in <lambda>
operator=lambda grp: agg_func(grp, *agg_args, **agg_kwargs),
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 3957, in agg_func
result = agg_method(grp, original_agg_func, *args, **kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/groupby/groupby.py", line 1824, in apply
result = self._python_apply_general(f, self._selected_obj)
File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/groupby/groupby.py", line 1885, in _python_apply_general
values, mutated = self._grouper.apply_groupwise(f, data, self.axis)
File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/groupby/ops.py", line 919, in apply_groupwise
res = f(group)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/utils.py", line 765, in wrapper
result = func(*args, **kwargs)
File "/home/ecommerce_production_classification/database.py", line 46, in <lambda>
print(data.loc[:5].groupby(by='company_id').apply(lambda x: detect_data(x)))
File "/home/ecommerce_production_classification/database.py", line 21, in detect_data
return classification(data, _rulesDF)
File "/home/ecommerce_production_classification/categorization.py", line 227, in classification
data = categorization(data, rules)
File "/home/ecommerce_production_classification/categorization.py", line 209, in categorization
return process(data, rules, '分类')
File "/home/ecommerce_production_classification/categorization.py", line 205, in process
data[rules['赋值'].columns]=pd.DataFrame(data.apply(getCategories, axis=1).to_dict()).T
File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 10374, in apply
return op.apply().__finalize__(self, method="apply")
File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 916, in apply
return self.apply_standard()
File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 1063, in apply_standard
results, res_index = self.apply_series_generator()
File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator
results[i] = self.func(v, *self.args, **self.kwargs)
File "/home/ecommerce_production_classification/categorization.py", line 168, in getCategories
_res = rules.loc[rules[('运算式','运算式')].apply(operationToBool)]
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
return obj(*args, **kwargs)
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/series.py", line 713, in apply
if result.name == self.index[0]:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Installed Versions
UserWarning: Setuptools is replacing distutils.
INSTALLED VERSIONS
commit : c8bbca8e4e00c681370e3736b2f73bb0352408c3 python : 3.10.12.final.0 python-bits : 64 OS : Linux OS-release : 3.10.0-1160.108.1.el7.x86_64 Version : #1 SMP Thu Jan 25 16:17:31 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
Modin dependencies
modin : 0.31.0 ray : 2.30.0 dask : None distributed : None
pandas dependencies
pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 65.5.0 pip : 24.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : 1.4.6 psycopg2 : None jinja2 : 3.1.4 IPython : 8.26.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.6.0 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.4 pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 2.0.31 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None