modin icon indicating copy to clipboard operation
modin copied to clipboard

Modin do not casts 'int' to 'str' at `read_csv` when Pandas do

Open dchigarev opened this issue 5 years ago • 2 comments

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Any
  • Modin version (modin.__version__): 0.8.0
  • Python version: 3.7.5
  • Code we can use to reproduce:
if __name__ == "__main__":
    import pandas
    import modin.pandas as pd
    from modin.pandas.test.test_io import (
        TEST_CSV_FILENAME,
        df_equals,
    )

    pandas.DataFrame({"col": ["str", 0, 1, 2]}).to_csv(TEST_CSV_FILENAME)

    md_df = pd.read_csv(TEST_CSV_FILENAME)
    pd_df = pandas.read_csv(TEST_CSV_FILENAME)

    try:
        df_equals(md_df, pd_df)
    except Exception as e:
        print(e)  # numpy array values are different (75.0 %)

    print(type(md_df["col"].iloc[1]), md_df["col"].iloc[1]) # <class 'numpy.int64'> 0
    print(type(pd_df["col"].iloc[1]), pd_df["col"].iloc[1]) # <class 'str'> '0'

Describe the problem

When any string value appears in column Pandas casts all values to a string then, but Modin don't do that, so we're getting mismatch with Pandas

dchigarev avatar Aug 18 '20 19:08 dchigarev

This still seems to be a problem on master. See the output of the reproducer (with slight changes):

DataFrame.iloc[:, 1] (column name="col") are different

DataFrame.iloc[:, 1] (column name="col") values are different (75.0 %)
[index]: [0, 1, 2, 3]
[left]:  [str, 0, 1, 2]
[right]: [str, 0, 1, 2]
<class 'numpy.int64'> 0
<class 'str'> 0

pyrito avatar Aug 18 '22 22:08 pyrito

The problem seems to be associated with dataframe partitioning.

For example, if we use the dataframe pandas.DataFrame({"col": ["str", 0, 1, 2, 3, 4, 5]}). We save it as a csv and load it into python using md_df = pd.read_csv(TEST_CSV_FILENAME). Given that the dataframe is split into 3 partitions in modin, it would look like this:

A = index    col        B = index    col        C = index    col
        0  'str'                0      2                0      4
        1    '0'                1      3                1      5
        2    '1'

Pandas usually casts all values in a column to strings if at least one value is a string. Using modin, all values in a column are cast to strings if at least one value in the same partition is a string. Otherwise values are left as is.

I attempted to implement a naive solution by iterating through all columns of all partitions and casting the column with astype() if the partition dtype and modin dataframe dtype were different. However, this resulted in significantly slower speeds for read_csv()

read_csv on modin with no changes: 231 seconds read_csv csv on modin with changes: 282 seconds read_csv on pandas: 331 seconds

A more efficient solution should be achievable by running the column type casting in parallel.

billiam-wang avatar Sep 12 '22 07:09 billiam-wang

Works on latest master for me so closing.

YarShev avatar Jan 10 '24 14:01 YarShev