doltpy
doltpy copied to clipboard
Use SQL data types when reading into pandas
This is a feature request. Currently, read_pandas
and read_pandas_sql
use SQL->CSV->Pandas conversion behind the scenes. It would be great if the resulting DataFrame could inherit some info from the SQL table schema, such as data types.
Minimal example:
dolt_database> CREATE TABLE mytable (
-> id int NOT NULL,
-> is_human bool(1),
-> PRIMARY KEY (id)
-> );
dolt_database> INSERT INTO mytable
-> VALUES (1, 1);
Query OK, 1 row affected
dolt_database> INSERT INTO mytable VALUES (2, 0);
Query OK, 1 row affected
dolt_database> SELECT * FROM mytable;
+----+----------+
| id | is_human |
+----+----------+
| 1 | 1 |
| 2 | 0 |
+----+----------+
In [5]: read_pandas_sql(dolt, "SELECT * from mytable")
Out[5]:
id is_human
0 1 1
1 2 0
In [7]: df.dtypes
Out[7]:
id object
is_human object
dtype: object
If this feature is implemented, the result would be:
In [11]: df.dtypes
Out[11]:
id int64
is_human bool
dtype: object
I appreciate that this is a time consuming change. I believe though that many use cases would benefit. In my case, I often use "read table via dolt, update dataframe in pandas, write updated table via dolt" workflow. Having more intelligent data type resolution would help me avoid different dtypes within a column. It would also reduce queries like:
df[df["my_bool_column"] == "1"]
Instead, I'd use
df[df["my_bool_column"]]