civis-python
civis-python copied to clipboard
Pandas Column Typing
Overview: Add functionality to get column types, including date, and sortkeys from Redshift meta info for Pandas IO
I modified the client to allow me to pass arguments to the pd.read_csv on line 227 in civis/io/_tables.py. I gather the column types using client.tables.get after which I map the SQL types to their corresponding Python types. Doing this eliminates the automatic str to int conversion, e.g. zipcodes, and automatically identifies the date columns. Is there any interest in adding this in a more thoughtful way to the code?
Yes this is a good idea! I am not sure who has time to get to it, but if you want to make a PR feel free!
+1 ! You can tag me for review. I've dealt with similar issues in my own code.
##Found Answer##
I have the time right now but I'm not super familiar with the code. Basically what's the fastest way to get the table meta given the database and table names because that seems to be readily available in the existing code? I currently do the following in my code:
- Get the table in the database by searching over the tables:
client.tables.list(database_id=DATABASE_META['id'], limit=1000)
- Use the table id to get column data
table_cols = client.tables.get(table['id'])['columns']
Hmmm. Those calls might require the table scanner. You could query the database for the info directly.
The types will be populated without a full table scanner run. You can use something like
db_id = client.get_database_id(database)
tables = client.tables.list(database_id=db_id, schema=schema, name=table)
columns = client.tables.get(tables[0].id).columns
dtypes = {col.name: redshift_to_py_type(col.sql_type)
for col in columns}
Here is the function I plan on using:
from civis.APIClient import get_table_id
Initial code where table_id is passed from read_civis
into read_civis_sql
if use_pandas:
table_meta = client.tables.get(table_id)
# The split is to handle e.g. DECIMAL(17,9)
redshift_types = {col['name']: col['sql_type'].split('(')[0]
for col in table_meta['columns']}
py_types = {name : SQL_PANDAS_MAP[type_]
for (name, type_) in redshift_types
if type_ not in DATE_TYPES}
date_cols = [name for (name, type_) in redshift_types
if type_ in DATE_TYPES]
sortkeys = table_meta['sortkeys'].split(',')
# Not sure about the best way to handle null fields
# or how they may appear
data = pd.read_csv(url, parse_dates=date_cols,
index_col=sort_keys,
converters=py_types)
Cool. Go ahead and make a branch and start the PR. We can do actual code review there.
This would be a major improvement. The difficult I've had with this before is that there doesn't seem to be a good way to get the type information from a general query passed into read_civis_sql
. If someone has a suggestion, I would love to hear it, but I think we could only provide types for read_civis
.
That said, it would still be useful to implement this for just read_civis
. I could imagine somewhere around here adding the code @stephen-hoover wrote above. Then we could pass dtypes
to read_civis_sql
here, since the kwargs in read_civis_sql
ultimately get passed to pandas.read_csv
.
Oof, we might have another problem. read_civis
accepts columns
, which could be arbitrary SQL. So if someone were to call
columns = ['my_var = 1']
read_civis('table', 'database', columns = columns)
We simply wouldn't have the type information.
Hm. Good point. pandas.read_csv
accepts a dtype
argument, which is how we'd implement this. The dtype
doesn't have to include all columns, and it can include columns which aren't present. I think the only time we'd run into trouble would be if users tried to export columns with the same name as a column in the actual table, but with a different type. I suppose that could happen, couldn't it?
I'm not sure what the right answer is. We could give a boolean switch to enable type inference when reading with pandas
. That would let users at least turn it off if they're transforming types. Or we could provide examples of how to do this yourself, or helper functions.