mars icon indicating copy to clipboard operation
mars copied to clipboard

[BUG] sort_values failed after using dropna

Open hoarjour opened this issue 4 years ago • 5 comments

Describe the bug when I try to use sort_values(ignore_index=True) after dropna, it raises TypeError:

a = md.Series([1,3,2,np.nan,np.nan])
a.dropna().sort_values(ignore_index=True).execute()

but I can do it in pandas:

b = pd.Series([1, 3, 2, np.nan, np.nan])
b.dropna().sort_values(ignore_index=True)

To Reproduce To help us reproducing this bug, please provide information below:

  1. Your Python version: 3.8.0
  2. The version of Mars you use: 0.6.11
  3. Versions of crucial packages, such as numpy, scipy and pandas: pandas: 1.1.3
  4. Full stack of the error.
ValueError                                Traceback (most recent call last)
c:\users\hoa'r'jou'r\appdata\local\programs\python\python38\lib\site-packages\pandas\core\dtypes\common.py in ensure_python_int(value)
    170     try:
--> 171         new_value = int(value)
    172         assert new_value == value

ValueError: cannot convert float NaN to integer

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-18-f7e878c753c1> in <module>
      1 a = md.Series([1,3,2,np.nan,np.nan])
----> 2 a.dropna().sort_values(ignore_index=True).execute()

c:\users\hoa'r'jou'r\appdata\local\programs\python\python38\lib\site-packages\mars\dataframe\sort\sort_values.py in series_sort_values(series, axis, ascending, inplace, kind, na_position, ignore_index, parallel_kind, psrs_kinds)
    317                              parallel_kind=parallel_kind, psrs_kinds=psrs_kinds,
    318                              output_types=[OutputType.series], gpu=series.op.is_gpu())
--> 319     sorted_series = op(series)
    320     if inplace:
    321         series.data = sorted_series.data

c:\users\hoa'r'jou'r\appdata\local\programs\python\python38\lib\site-packages\mars\utils.py in _inner(*args, **kwargs)
    454         def _inner(*args, **kwargs):
    455             with self:
--> 456                 return func(*args, **kwargs)
    457 
    458         return _inner

c:\users\hoa'r'jou'r\appdata\local\programs\python\python38\lib\site-packages\mars\dataframe\sort\sort_values.py in __call__(self, a)
     97         assert self.axis == 0
     98         if self.ignore_index:
---> 99             index_value = parse_index(pd.RangeIndex(a.shape[0]))
    100         else:
    101             if isinstance(a.index_value.value, IndexValue.RangeIndex):

c:\users\hoa'r'jou'r\appdata\local\programs\python\python38\lib\site-packages\pandas\core\indexes\range.py in __new__(cls, start, stop, step, dtype, copy, name)
    100             raise TypeError("RangeIndex(...) must be called with integers")
    101 
--> 102         start = ensure_python_int(start) if start is not None else 0
    103 
    104         if stop is None:

c:\users\hoa'r'jou'r\appdata\local\programs\python\python38\lib\site-packages\pandas\core\dtypes\common.py in ensure_python_int(value)
    172         assert new_value == value
    173     except (TypeError, ValueError, AssertionError) as err:
--> 174         raise TypeError(f"Wrong type {type(value)} for value {value}") from err
    175     return new_value
    176 

TypeError: Wrong type <class 'float'> for value nan

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here.

hoarjour avatar Sep 27 '21 04:09 hoarjour

Please copy-paste your code and error message instead of screenshots.

hekaisheng avatar Sep 27 '21 05:09 hekaisheng

Can be fixed by parsing pd.RangeIndex(-1) when size of certain dimension is unknown.

wjsi avatar Oct 08 '21 09:10 wjsi

Hello :) I'm a beginner to open source and I'd like to resolve this issue. Is it still relevant?

DanielGoman avatar Oct 02 '22 14:10 DanielGoman

Hello :) I'm a beginner to open source and I'd like to resolve this issue. Is it still relevant?

Super welcome, you can try to fix this, feel free to ask question if you encounter any issue.

qinxuye avatar Oct 11 '22 04:10 qinxuye

Hello. I'm new to the open source pull request thing, but I've forked and sent out a pull request at https://github.com/mars-project/mars/pull/3363

I would note that running black as suggested for linting also edited mars/learn/contrib/lightgbm/tests/test_classifier.py.

Edits at a glance: mars\dataframe\sort\sort_values.py Lines 111 - 114 From:

  def __call__(self, a):
        assert self.axis == 0
        if self.ignore_index:
            index_value = parse_index(pd.RangeIndex(a.shape[0]))
        else:
            if isinstance(a.index_value.value, IndexValue.RangeIndex):
                index_value = parse_index(pd.Index([], dtype=np.int64))
            else:
                index_value = a.index_value
    -snip-

To:

    def __call__(self, a):
        assert self.axis == 0
        if self.ignore_index:
            if type(a.shape[0]) != int:
                index_value = parse_index(pd.RangeIndex(-1))
            else:
                index_value = parse_index(pd.RangeIndex(a.shape[0]))
        else:
            if isinstance(a.index_value.value, IndexValue.RangeIndex):
                index_value = parse_index(pd.Index([], dtype=np.int64))
            else:
                index_value = a.index_value
    -snip-

Gist - Code to recreate problem + some notes (since it's an old issue) https://gist.github.com/Shaun2h/cf294782c840eaa1223caf2e4ad5bfd0

Shaun2h avatar Oct 07 '23 15:10 Shaun2h