doltpy icon indicating copy to clipboard operation
doltpy copied to clipboard

Use SQL data types when reading into pandas

Open alkamid opened this issue 2 years ago • 6 comments

This is a feature request. Currently, read_pandas and read_pandas_sql use SQL->CSV->Pandas conversion behind the scenes. It would be great if the resulting DataFrame could inherit some info from the SQL table schema, such as data types.

Minimal example:

dolt_database> CREATE TABLE mytable (
            -> id int NOT NULL,
            -> is_human bool(1),
            -> PRIMARY KEY (id)
            -> );
dolt_database> INSERT INTO mytable
            -> VALUES (1, 1);
Query OK, 1 row affected
dolt_database> INSERT INTO mytable VALUES (2, 0);
Query OK, 1 row affected
dolt_database> SELECT * FROM mytable;
+----+----------+
| id | is_human |
+----+----------+
| 1  | 1        |
| 2  | 0        |
+----+----------+
In [5]: read_pandas_sql(dolt, "SELECT * from mytable")
Out[5]:
  id is_human
0  1        1
1  2        0

In [7]: df.dtypes
Out[7]:
id          object
is_human    object
dtype: object

If this feature is implemented, the result would be:

In [11]: df.dtypes
Out[11]:
id          int64
is_human     bool
dtype: object

I appreciate that this is a time consuming change. I believe though that many use cases would benefit. In my case, I often use "read table via dolt, update dataframe in pandas, write updated table via dolt" workflow. Having more intelligent data type resolution would help me avoid different dtypes within a column. It would also reduce queries like:

df[df["my_bool_column"] == "1"]

Instead, I'd use

df[df["my_bool_column"]]

alkamid avatar May 31 '22 10:05 alkamid