dask icon indicating copy to clipboard operation
dask copied to clipboard

dataframe.read_sql_query does not work with datetime index

Open ljmc-github opened this issue 3 years ago • 2 comments

What happened:

Dask failed to read_sql_query for a datetime index.

See SO question here (not mine, only reporting from it), and MCVE below.

What you expected to happen:

Dask reads the database and returns a df.

Minimal Complete Verifiable Example:

from datetime import datetime

import dask.dataframe as dd
import pandas as pd
from sqlalchemy import Column, DateTime, Integer, create_engine, select
from sqlalchemy.orm import Session, declarative_base

Base = declarative_base()


class Loan(Base):
    __tablename__ = "test_loans"
    id = Column(Integer, primary_key=True)
    date = Column(DateTime)
    balance = Column(Integer)


engine = create_engine("sqlite:////tmp/so.db", future=True, echo=True)

Base.metadata.create_all(engine)

with Session(engine) as session:
    session.add_all(
        [
            Loan(date=datetime.min, balance=10),
            Loan(date=datetime.min, balance=11),
            Loan(date=datetime.min, balance=12),
            Loan(date=datetime.min, balance=13),
        ]
    )
    session.commit()

dd.read_sql_query(
    select([Loan.date, Loan.balance]),
    con="sqlite:////tmp/so.db",
    index_col="date",
    npartitions=3,
    engine_kwargs={"echo": True},
)

Anything else we need to know?:

The issue appears to come from this block which calculates the limits before dividing them into partitions:

https://github.com/dask/dask/blob/8b95f983c232c1bd628e9cba0695d3ef229d290b/dask/dataframe/io/sql.py#L130-L137

Which essentially runs the following:

pd.read_sql(select([func.min(Loan.date), func.max(Loan.date)]), con=engine.connect())

Whose output is

                 min_1                max_1
0  0001-01-01 00:00:00  0001-01-01 00:00:00

and dtypes are objects since by default the parse_dates kwarg is None and datetimes don't get parsed

min_1    object
max_1    object
dtype: object

With dtype being object then dask falls to

https://github.com/dask/dask/blob/8b95f983c232c1bd628e9cba0695d3ef229d290b/dask/dataframe/io/sql.py#L160-L164

and raises the exception.

Environment:

  • Dask version: dask-2022.8.0
  • Python version: 3.10.5
  • Operating System: N/A
  • Install method (conda, pip, source): pip

ljmc-github avatar Aug 13 '22 13:08 ljmc-github

I am happy to send a PR if you can guide me for the solution as I am brand new to dask.

ljmc-github avatar Aug 13 '22 13:08 ljmc-github

I have looked into this a little bit and I found that the problem only exists with read_sql_query and not with read_sql_table. As far as I can tell, this is because read_sql_table autoloads all the table scheme, e.g. the data types from the SQL table. Thus, parse_dates isn't even necessary, as the dates are automatically parsed by SQLAlchemy.

I tried around with passing parse_dates to every pd.read_sql... but the results were quite bad:

  • parse_dates could not parse datetime.min (out of bound error)
  • SQLAlchemy then cannot bind pd.Timestamps into the query, only datetime.datetime work
  • when converted to datetimes, SELECT datetime FROM test WHERE datetime >= '2022-08-01 00:00:00' will not select rows where datetime == '2022-08-01 00:00:00'

While I think most problems could be resolved, I think the best solution would be to update the documentation to point out that

The problem is solved by casting the column as DateTime in the SQLAlchemy select statement.

Which is the cleanest way of handling the problem. Plus, it would be nice to raise a helpful error just in case someone does not read the docs too carefully.

McToel avatar Aug 16 '22 02:08 McToel