modin
modin copied to clipboard
BUG: ArrowInvalid Error when Converting Concatenated Modin DataFrame to PyArrow Table
Modin version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest released version of Modin.
-
[ ] I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
import modin.pandas as mpd
import pyarrow as pa
# Creating Modin DataFrames
mpd_df_1 = mpd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', '5', '6']})
mpd_df_2 = mpd.DataFrame({'a': ['7', '8', '9'], 'b': ['10', '11', '12']})
mpd_df_3 = mpd.DataFrame({'a': ['13', '14', '15'], 'b': ['16', '17', '18']})
# Concatenating DataFrames
test_df = mpd.concat([mpd_df_1, mpd_df_2, mpd_df_3])
# Converting to PyArrow Table (Error Occurs Here)
table_from_modin = pa.Table.from_pandas(df=test_df)
Issue Description
I encountered an ArrowInvalid error when attempting to convert a concatenated Modin DataFrame into a PyArrow table. This issue arises specifically when concatenating multiple Modin DataFrames and then converting the resulting DataFrame to a PyArrow table. The same process works correctly with Pandas DataFrames. The same process works correctly with non concatenated modin DataFrame.
Expected Behavior
The concatenated Modin DataFrame should be converted to a PyArrow table without any errors, similar to how a concatenated Pandas DataFrame or not concat modin Dataframe behaves
Error Logs
ArrowInvalid: ('Could not convert 0 1\n0 7\n0 13\nName: a, dtype: object with type Series: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column a with type object')
Installed Versions
INSTALLED VERSIONS
commit : b99cf06a0d3bf27cb5d2d03d52a285cc91cfb6f6 python : 3.10.5.final.0 python-bits : 64 OS : Darwin OS-release : 23.0.0 Version : Darwin Kernel Version 23.0.0: Fri Sep 15 14:42:57 PDT 2023; root:xnu-10002.1.13~1/RELEASE_ARM64_T8112 machine : arm64 processor : arm byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
Modin dependencies
modin : 0.25.1 ray : 2.8.1 dask : None distributed : None hdk : None
pandas dependencies
pandas : 2.1.3 numpy : 1.26.2 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 65.6.3 pip : 22.3.1 Cython : None pytest : 7.4.3 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.9 jinja2 : 3.1.2 IPython : 8.18.1 pandas_datareader : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat: None fastparquet : None fsspec : 2023.12.0 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 12.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 2.0.23 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
Hi @DimitarSirakov, thanks for the contribution!
I can reproduce the problem. Take a look.
I found out that ignore_index=True fixes the problem but I can't figure out why and why this is not the case in pandas.
@DimitarSirakov iteration over the index occurs when converting; if the index is not unique, then it may return not a scalar, but a Modin Series, with which pyarrow cannot work.
An interchange protocol was developed for interaction between libraries. An implementation of this protocol can be used through https://arrow.apache.org/docs/13.0/python/generated/pyarrow.interchange.from_dataframe.html#pyarrow.interchange.from_dataframe. However, as far as I know, it appears in pyarrow starting from version 13.0.0. Do you have the opportunity to try this option?
@DimitarSirakov if this option is not suitable for you, then we may consider implementing def __arrow_array__(self, type=None): for Series object, which will fix the problems with your current reproducer, but will most likely be slower.
The implementation is awesome. However, the DataFrame exchange protocol doesn't fully support certain types, such as datetime64[ms]. Fortunately, I've found some workable solutions, so it's manageable.
Given my goal of maximizing efficiency with the resources at hand, I'll have to pass on this one :). I thought Modin would be an almost perfect drop-in replacement for Pandas, and perhaps that led me to expect a similar outcome here.
Thanks for the support – it's been a real time-saver.