modin BUG: Internal and external indices on axis 1 do not match.

Modin version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest released version of Modin.
[ ] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as mpd
import numpy as np

import string

# Function to generate a random string of given length
def generate_random_string(length):
    return ''.join(np.random.choice(list(string.ascii_letters)) for _ in range(length))

# Generate a list of 1500 strings
list_of_strings = []
for _ in range(1500):
    string_length = np.random.randint(5, 63)
    random_string = generate_random_string(string_length)
    list_of_strings.append(random_string)
    
# Generate a small amount of data to be queried
for _ in range(5):
    string_length = np.random.randint(64, 100)
    random_string = generate_random_string(string_length)
    list_of_strings.append(random_string)

def split(df, split_size):
    df_ = mpd.DataFrame(columns=['uid', 'seq', 'len'])
    df_['seq'] = [
        df['seq'][x:x+split_size] 
        for x in range(0, df['len'], split_size)]
    df_['uid'] = df['uid']
    df_['len'] = df_['seq'].apply(len)
    return df_


df = mpd.DataFrame(columns=['uid', 'seq', 'len'])
df['seq'] = list_of_strings
df['uid'] = 'test'
df['len'] = df['seq'].apply(len)

print(df)

splitted_df = df[df['len'] > 64].apply(split, axis=1, split_size=64).reset_index(drop=True)

print(splitted_df)

Issue Description

when doing apply() on a queried dataframe, there's a chance to get internal error when queired dataframe is small. It does sometimes work when queries dataframe have many rows.

Expected Behavior

get the expected result without any error.

Error Logs


Replace this line with the error backtrace (if applicable).

Installed Versions

INSTALLED VERSIONS

commit : d54dcfd8e4cceecdbf818a48bbc712854dda906e python : 3.9.15.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-92-generic Version : #102~20.04.1-Ubuntu SMP Mon Jan 15 13:09:14 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

Modin dependencies

modin : 0.27.0 ray : 2.9.2 dask : 2024.2.0 distributed : 2024.2.0 hdk : None

pandas dependencies

pandas : 2.2.1 numpy : 1.24.4 pytz : 2024.1 dateutil : 2.8.2 setuptools : 58.1.0 pip : 24.0 Cython : None pytest : 7.4.4 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.18.1 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.2.0 gcsfs : None matplotlib : 3.7.5 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.10.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Feb 28 '24 20:02 SiRumCz

@SiRumCz, I can confirm that the issue happens on master. I would like to clarify whether the resultant series (splitted_df) containing multiple DataFrames is the expected result to you? I see the following output with pandas.

print(splitted_df)
[1505 rows x 3 columns]
0        uid                                       ...
1        uid                                       ...
2        uid                                       ...
3        uid                                       ...
4        uid                                       ...
dtype: object

print(splitted_df[0])
    uid                                                seq  len
0  test  fuMmtQebxFZBfeCANHayKnbKmkNeKdEjczKsLyQrfSsuaR...   64
1  test                                            QYCMNtm    7

Mar 01 '24 15:03 YarShev

@YarShev Yes it is expected. My goal is to split bigger row with large data into many rows with smaller data, currently I have an apply() and then follow with something like pd.concat(splitted_df.tolist(), ignore_index=True), perhaps there's a better way to code it?

Mar 01 '24 15:03 SiRumCz

There might be other way to do what you want to achieve but I am afraid of we will get into a similar issue with indices mismatch. We will take a look at this issue.

Mar 01 '24 16:03 YarShev

@YarShev What I have done to workaround this issue is to not query the dataframe. Instead, I apply the function on every row, check for the length, if is not greater than the size, return np.nan, then I will have to do a dropna() before the concat.

Mar 01 '24 17:03 SiRumCz

Good to hear that you have been able to workaround the issue. But we should definitely figure out the root cause of the issue. Hope to schedule this task on a future release.

Mar 01 '24 17:03 YarShev

The issue kind of relates to empty partitions that are left after df[df['len'] > 64]. After doing apply we have partitions (empty and non-empty) that have different columns (correct columns are in non-empty partitions, outdated/wrong columns are in empty partitions). Then, when we compute columns at the end of apply for a Modin DataFrame, we retrieve them from empty partitions, which is wrong. This is a known issue for us and we already planned to properly handle this case but not in the upcoming months, unfortunately. I hope, the workaround you have done is sufficient to you for now.

Mar 01 '24 19:03 YarShev

yes, I can live with it for now.

Mar 01 '24 21:03 SiRumCz

modin modin copied to clipboard

BUG: Internal and external indices on axis 1 do not match.

Modin version checks

Reproducible Example

Issue Description

Expected Behavior

Error Logs

Installed Versions

INSTALLED VERSIONS

Modin dependencies

pandas dependencies

modin
modin copied to clipboard