modin
modin copied to clipboard
`parse_dates` parameter in OmniSci's `read_csv` doesn't work since Arrow 3.0
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Any
- Modin version (
modin.__version__): 0.11 - Python version: 3.7.5
- Code we can use to reproduce:
import modin.pandas as pd
import pandas
pandas.DataFrame({"a": ["2020-01-01", "2020-02-02"]}).to_csv("test.csv")
md_df = pd.read_csv("test.csv", parse_dates=False)
pd_df = pandas.read_csv("test.csv", parse_dates=False)
print(f"modin dtypes:\n{md_df.dtypes}\n") # a: datetime64[ns]
print(f"pandas dtypes:\n{pd_df.dtypes}\n") # a: object
Output:
modin dtypes:
int64
a datetime64[ns]
dtype: object
pandas dtypes:
Unnamed: 0 int64
a object
dtype: object
Describe the problem
In Arrow 3.0 they introduced an auto date32 type inference in Arrow's read_csv, there is no way of not doing automatic inference unless providing an explicit type scheme. Computing a type scheme to properly handle pandas parse_dates appears to be very expensive in some cases (in the case of a wide frame mostly), so it was decided to do this only in a strict compatibility mode. Until then, parse_dates parameter is considered to be unsupported in Omnisci backend.
Commit that brings parse_dates support considering that strict mode is implemented.
What is the suggested way to read a CSV file with datetime columns now? AFAIK converting columns to datetime after reading is impossible too https://docs-new.omnisci.com/sql/data-manipulation-dml/sql-capabilities#type-cast-support
@gshimansky all of the columns are converted to DateTime automatically since arrow 3.0, this is a deviation from pandas since pandas don't convert any columns to DateTime until they are explicitly specified via parse_dates parameter. To achieve pandas behaviour of non-converting certain columns to DateTime you can specify explicit type-scheme via dtype parameter of the read_csv:
df = pd.read_csv(filepath, dtype={"datetime_column_that_shouldn't_be_converted_to_datetime_type": "string"})
@dchigarev Thank you for clarification. I missed the part that Arrow now detects datetime automatically.
The error is reproduced on the master.
HDK engine is deprecated and will be removed in a future version.