modin icon indicating copy to clipboard operation
modin copied to clipboard

BUG: ArrowInvalid Error when Converting Concatenated Modin DataFrame to PyArrow Table

Open DimitarSirakov opened this issue 1 year ago • 5 comments
trafficstars

Modin version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest released version of Modin.

  • [ ] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as mpd
import pyarrow as pa

# Creating Modin DataFrames
mpd_df_1 = mpd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', '5', '6']})
mpd_df_2 = mpd.DataFrame({'a': ['7', '8', '9'], 'b': ['10', '11', '12']})
mpd_df_3 = mpd.DataFrame({'a': ['13', '14', '15'], 'b': ['16', '17', '18']})

# Concatenating DataFrames
test_df = mpd.concat([mpd_df_1, mpd_df_2, mpd_df_3])

# Converting to PyArrow Table (Error Occurs Here)
table_from_modin = pa.Table.from_pandas(df=test_df)

Issue Description

I encountered an ArrowInvalid error when attempting to convert a concatenated Modin DataFrame into a PyArrow table. This issue arises specifically when concatenating multiple Modin DataFrames and then converting the resulting DataFrame to a PyArrow table. The same process works correctly with Pandas DataFrames. The same process works correctly with non concatenated modin DataFrame.

Expected Behavior

The concatenated Modin DataFrame should be converted to a PyArrow table without any errors, similar to how a concatenated Pandas DataFrame or not concat modin Dataframe behaves

Error Logs


ArrowInvalid: ('Could not convert 0     1\n0     7\n0    13\nName: a, dtype: object with type Series: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column a with type object')

Installed Versions

INSTALLED VERSIONS

commit : b99cf06a0d3bf27cb5d2d03d52a285cc91cfb6f6 python : 3.10.5.final.0 python-bits : 64 OS : Darwin OS-release : 23.0.0 Version : Darwin Kernel Version 23.0.0: Fri Sep 15 14:42:57 PDT 2023; root:xnu-10002.1.13~1/RELEASE_ARM64_T8112 machine : arm64 processor : arm byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

Modin dependencies

modin : 0.25.1 ray : 2.8.1 dask : None distributed : None hdk : None

pandas dependencies

pandas : 2.1.3 numpy : 1.26.2 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 65.6.3 pip : 22.3.1 Cython : None pytest : 7.4.3 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.9 jinja2 : 3.1.2 IPython : 8.18.1 pandas_datareader : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : 2023.12.0 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 12.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 2.0.23 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None

DimitarSirakov avatar Dec 07 '23 15:12 DimitarSirakov

Hi @DimitarSirakov, thanks for the contribution!

I can reproduce the problem. Take a look.

anmyachev avatar Dec 07 '23 15:12 anmyachev

I found out that ignore_index=True fixes the problem but I can't figure out why and why this is not the case in pandas.

DimitarSirakov avatar Dec 07 '23 15:12 DimitarSirakov

@DimitarSirakov iteration over the index occurs when converting; if the index is not unique, then it may return not a scalar, but a Modin Series, with which pyarrow cannot work.

An interchange protocol was developed for interaction between libraries. An implementation of this protocol can be used through https://arrow.apache.org/docs/13.0/python/generated/pyarrow.interchange.from_dataframe.html#pyarrow.interchange.from_dataframe. However, as far as I know, it appears in pyarrow starting from version 13.0.0. Do you have the opportunity to try this option?

anmyachev avatar Dec 07 '23 17:12 anmyachev

@DimitarSirakov if this option is not suitable for you, then we may consider implementing def __arrow_array__(self, type=None): for Series object, which will fix the problems with your current reproducer, but will most likely be slower.

anmyachev avatar Dec 07 '23 18:12 anmyachev

The implementation is awesome. However, the DataFrame exchange protocol doesn't fully support certain types, such as datetime64[ms]. Fortunately, I've found some workable solutions, so it's manageable.

Given my goal of maximizing efficiency with the resources at hand, I'll have to pass on this one :). I thought Modin would be an almost perfect drop-in replacement for Pandas, and perhaps that led me to expect a similar outcome here.

Thanks for the support – it's been a real time-saver.

DimitarSirakov avatar Dec 07 '23 19:12 DimitarSirakov