dask
dask copied to clipboard
dataframe.read_sql_query does not work with datetime index
What happened:
Dask failed to read_sql_query for a datetime index.
See SO question here (not mine, only reporting from it), and MCVE below.
What you expected to happen:
Dask reads the database and returns a df.
Minimal Complete Verifiable Example:
from datetime import datetime
import dask.dataframe as dd
import pandas as pd
from sqlalchemy import Column, DateTime, Integer, create_engine, select
from sqlalchemy.orm import Session, declarative_base
Base = declarative_base()
class Loan(Base):
__tablename__ = "test_loans"
id = Column(Integer, primary_key=True)
date = Column(DateTime)
balance = Column(Integer)
engine = create_engine("sqlite:////tmp/so.db", future=True, echo=True)
Base.metadata.create_all(engine)
with Session(engine) as session:
session.add_all(
[
Loan(date=datetime.min, balance=10),
Loan(date=datetime.min, balance=11),
Loan(date=datetime.min, balance=12),
Loan(date=datetime.min, balance=13),
]
)
session.commit()
dd.read_sql_query(
select([Loan.date, Loan.balance]),
con="sqlite:////tmp/so.db",
index_col="date",
npartitions=3,
engine_kwargs={"echo": True},
)
Anything else we need to know?:
The issue appears to come from this block which calculates the limits before dividing them into partitions:
https://github.com/dask/dask/blob/8b95f983c232c1bd628e9cba0695d3ef229d290b/dask/dataframe/io/sql.py#L130-L137
Which essentially runs the following:
pd.read_sql(select([func.min(Loan.date), func.max(Loan.date)]), con=engine.connect())
Whose output is
min_1 max_1
0 0001-01-01 00:00:00 0001-01-01 00:00:00
and dtypes are objects since by default the parse_dates kwarg is None and datetimes don't get parsed
min_1 object
max_1 object
dtype: object
With dtype being object then dask falls to
https://github.com/dask/dask/blob/8b95f983c232c1bd628e9cba0695d3ef229d290b/dask/dataframe/io/sql.py#L160-L164
and raises the exception.
Environment:
- Dask version: dask-2022.8.0
- Python version: 3.10.5
- Operating System: N/A
- Install method (conda, pip, source): pip
I am happy to send a PR if you can guide me for the solution as I am brand new to dask.
I have looked into this a little bit and I found that the problem only exists with read_sql_query and not with read_sql_table. As far as I can tell, this is because read_sql_table autoloads all the table scheme, e.g. the data types from the SQL table. Thus, parse_dates isn't even necessary, as the dates are automatically parsed by SQLAlchemy.
I tried around with passing parse_dates to every pd.read_sql... but the results were quite bad:
parse_datescould not parsedatetime.min(out of bound error)- SQLAlchemy then cannot bind pd.Timestamps into the query, only datetime.datetime work
- when converted to datetimes, SELECT datetime FROM test WHERE datetime >= '2022-08-01 00:00:00' will not select rows where datetime == '2022-08-01 00:00:00'
While I think most problems could be resolved, I think the best solution would be to update the documentation to point out that
The problem is solved by casting the column as DateTime in the SQLAlchemy select statement.
Which is the cleanest way of handling the problem. Plus, it would be nice to raise a helpful error just in case someone does not read the docs too carefully.