modin
modin copied to clipboard
Modin do not casts 'int' to 'str' at `read_csv` when Pandas do
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Any
- Modin version (
modin.__version__): 0.8.0 - Python version: 3.7.5
- Code we can use to reproduce:
if __name__ == "__main__":
import pandas
import modin.pandas as pd
from modin.pandas.test.test_io import (
TEST_CSV_FILENAME,
df_equals,
)
pandas.DataFrame({"col": ["str", 0, 1, 2]}).to_csv(TEST_CSV_FILENAME)
md_df = pd.read_csv(TEST_CSV_FILENAME)
pd_df = pandas.read_csv(TEST_CSV_FILENAME)
try:
df_equals(md_df, pd_df)
except Exception as e:
print(e) # numpy array values are different (75.0 %)
print(type(md_df["col"].iloc[1]), md_df["col"].iloc[1]) # <class 'numpy.int64'> 0
print(type(pd_df["col"].iloc[1]), pd_df["col"].iloc[1]) # <class 'str'> '0'
Describe the problem
When any string value appears in column Pandas casts all values to a string then, but Modin don't do that, so we're getting mismatch with Pandas
This still seems to be a problem on master. See the output of the reproducer (with slight changes):
DataFrame.iloc[:, 1] (column name="col") are different
DataFrame.iloc[:, 1] (column name="col") values are different (75.0 %)
[index]: [0, 1, 2, 3]
[left]: [str, 0, 1, 2]
[right]: [str, 0, 1, 2]
<class 'numpy.int64'> 0
<class 'str'> 0
The problem seems to be associated with dataframe partitioning.
For example, if we use the dataframe pandas.DataFrame({"col": ["str", 0, 1, 2, 3, 4, 5]}). We save it as a csv and load it into python using md_df = pd.read_csv(TEST_CSV_FILENAME). Given that the dataframe is split into 3 partitions in modin, it would look like this:
A = index col B = index col C = index col
0 'str' 0 2 0 4
1 '0' 1 3 1 5
2 '1'
Pandas usually casts all values in a column to strings if at least one value is a string. Using modin, all values in a column are cast to strings if at least one value in the same partition is a string. Otherwise values are left as is.
I attempted to implement a naive solution by iterating through all columns of all partitions and casting the column with astype() if the partition dtype and modin dataframe dtype were different. However, this resulted in significantly slower speeds for read_csv()
read_csv on modin with no changes: 231 seconds read_csv csv on modin with changes: 282 seconds read_csv on pandas: 331 seconds
A more efficient solution should be achievable by running the column type casting in parallel.
Works on latest master for me so closing.