modin
modin copied to clipboard
BUG: Internal and external indices on axis 1 do not match.
Modin version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest released version of Modin.
-
[ ] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
import modin.pandas as mpd
import numpy as np
import string
# Function to generate a random string of given length
def generate_random_string(length):
return ''.join(np.random.choice(list(string.ascii_letters)) for _ in range(length))
# Generate a list of 1500 strings
list_of_strings = []
for _ in range(1500):
string_length = np.random.randint(5, 63)
random_string = generate_random_string(string_length)
list_of_strings.append(random_string)
# Generate a small amount of data to be queried
for _ in range(5):
string_length = np.random.randint(64, 100)
random_string = generate_random_string(string_length)
list_of_strings.append(random_string)
def split(df, split_size):
df_ = mpd.DataFrame(columns=['uid', 'seq', 'len'])
df_['seq'] = [
df['seq'][x:x+split_size]
for x in range(0, df['len'], split_size)]
df_['uid'] = df['uid']
df_['len'] = df_['seq'].apply(len)
return df_
df = mpd.DataFrame(columns=['uid', 'seq', 'len'])
df['seq'] = list_of_strings
df['uid'] = 'test'
df['len'] = df['seq'].apply(len)
print(df)
splitted_df = df[df['len'] > 64].apply(split, axis=1, split_size=64).reset_index(drop=True)
print(splitted_df)
Issue Description
when doing apply()
on a queried dataframe, there's a chance to get internal error when queired dataframe is small. It does sometimes work when queries dataframe have many rows.
Expected Behavior
get the expected result without any error.
Error Logs
Replace this line with the error backtrace (if applicable).
Installed Versions
INSTALLED VERSIONS
commit : d54dcfd8e4cceecdbf818a48bbc712854dda906e python : 3.9.15.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-92-generic Version : #102~20.04.1-Ubuntu SMP Mon Jan 15 13:09:14 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
Modin dependencies
modin : 0.27.0 ray : 2.9.2 dask : 2024.2.0 distributed : 2024.2.0 hdk : None
pandas dependencies
pandas : 2.2.1 numpy : 1.24.4 pytz : 2024.1 dateutil : 2.8.2 setuptools : 58.1.0 pip : 24.0 Cython : None pytest : 7.4.4 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.18.1 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.2.0 gcsfs : None matplotlib : 3.7.5 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.10.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
@SiRumCz, I can confirm that the issue happens on master. I would like to clarify whether the resultant series (splitted_df) containing multiple DataFrames is the expected result to you? I see the following output with pandas.
print(splitted_df)
[1505 rows x 3 columns]
0 uid ...
1 uid ...
2 uid ...
3 uid ...
4 uid ...
dtype: object
print(splitted_df[0])
uid seq len
0 test fuMmtQebxFZBfeCANHayKnbKmkNeKdEjczKsLyQrfSsuaR... 64
1 test QYCMNtm 7
@YarShev Yes it is expected. My goal is to split bigger row with large data into many rows with smaller data, currently I have an apply()
and then follow with something like pd.concat(splitted_df.tolist(), ignore_index=True)
, perhaps there's a better way to code it?
There might be other way to do what you want to achieve but I am afraid of we will get into a similar issue with indices mismatch. We will take a look at this issue.
@YarShev What I have done to workaround this issue is to not query the dataframe. Instead, I apply the function on every row, check for the length, if is not greater than the size, return np.nan, then I will have to do a dropna() before the concat.
Good to hear that you have been able to workaround the issue. But we should definitely figure out the root cause of the issue. Hope to schedule this task on a future release.
The issue kind of relates to empty partitions that are left after df[df['len'] > 64]
. After doing apply
we have partitions (empty and non-empty) that have different columns (correct columns are in non-empty partitions, outdated/wrong columns are in empty partitions). Then, when we compute columns at the end of apply for a Modin DataFrame, we retrieve them from empty partitions, which is wrong. This is a known issue for us and we already planned to properly handle this case but not in the upcoming months, unfortunately. I hope, the workaround you have done is sufficient to you for now.
yes, I can live with it for now.