modin icon indicating copy to clipboard operation
modin copied to clipboard

`parse_dates` parameter in OmniSci's `read_csv` doesn't work since Arrow 3.0

Open dchigarev opened this issue 4 years ago • 4 comments

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Any
  • Modin version (modin.__version__): 0.11
  • Python version: 3.7.5
  • Code we can use to reproduce:
import modin.pandas as pd
import pandas

pandas.DataFrame({"a": ["2020-01-01", "2020-02-02"]}).to_csv("test.csv")

md_df = pd.read_csv("test.csv", parse_dates=False)
pd_df = pandas.read_csv("test.csv", parse_dates=False)

print(f"modin dtypes:\n{md_df.dtypes}\n")  # a: datetime64[ns]
print(f"pandas dtypes:\n{pd_df.dtypes}\n") # a: object

Output:

modin dtypes:
              int64
a    datetime64[ns]
dtype: object

pandas dtypes:
Unnamed: 0     int64
a             object
dtype: object

Describe the problem

In Arrow 3.0 they introduced an auto date32 type inference in Arrow's read_csv, there is no way of not doing automatic inference unless providing an explicit type scheme. Computing a type scheme to properly handle pandas parse_dates appears to be very expensive in some cases (in the case of a wide frame mostly), so it was decided to do this only in a strict compatibility mode. Until then, parse_dates parameter is considered to be unsupported in Omnisci backend.

Commit that brings parse_dates support considering that strict mode is implemented.

dchigarev avatar Sep 27 '21 11:09 dchigarev

What is the suggested way to read a CSV file with datetime columns now? AFAIK converting columns to datetime after reading is impossible too https://docs-new.omnisci.com/sql/data-manipulation-dml/sql-capabilities#type-cast-support

gshimansky avatar Sep 27 '21 19:09 gshimansky

@gshimansky all of the columns are converted to DateTime automatically since arrow 3.0, this is a deviation from pandas since pandas don't convert any columns to DateTime until they are explicitly specified via parse_dates parameter. To achieve pandas behaviour of non-converting certain columns to DateTime you can specify explicit type-scheme via dtype parameter of the read_csv:

df = pd.read_csv(filepath, dtype={"datetime_column_that_shouldn't_be_converted_to_datetime_type": "string"})

dchigarev avatar Sep 27 '21 22:09 dchigarev

@dchigarev Thank you for clarification. I missed the part that Arrow now detects datetime automatically.

gshimansky avatar Sep 28 '21 00:09 gshimansky

The error is reproduced on the master.

anmyachev avatar Jul 02 '23 11:07 anmyachev

HDK engine is deprecated and will be removed in a future version.

YarShev avatar May 15 '24 18:05 YarShev