civis-python icon indicating copy to clipboard operation
civis-python copied to clipboard

Pandas Column Typing

Open gdj0nes opened this issue 7 years ago • 11 comments

Overview: Add functionality to get column types, including date, and sortkeys from Redshift meta info for Pandas IO

I modified the client to allow me to pass arguments to the pd.read_csv on line 227 in civis/io/_tables.py. I gather the column types using client.tables.get after which I map the SQL types to their corresponding Python types. Doing this eliminates the automatic str to int conversion, e.g. zipcodes, and automatically identifies the date columns. Is there any interest in adding this in a more thoughtful way to the code?

gdj0nes avatar Sep 18 '17 14:09 gdj0nes

Yes this is a good idea! I am not sure who has time to get to it, but if you want to make a PR feel free!

beckermr avatar Sep 18 '17 14:09 beckermr

+1 ! You can tag me for review. I've dealt with similar issues in my own code.

stephen-hoover avatar Sep 18 '17 14:09 stephen-hoover

##Found Answer##

I have the time right now but I'm not super familiar with the code. Basically what's the fastest way to get the table meta given the database and table names because that seems to be readily available in the existing code? I currently do the following in my code:

  1. Get the table in the database by searching over the tables: client.tables.list(database_id=DATABASE_META['id'], limit=1000)
  2. Use the table id to get column data table_cols = client.tables.get(table['id'])['columns']

gdj0nes avatar Sep 18 '17 16:09 gdj0nes

Hmmm. Those calls might require the table scanner. You could query the database for the info directly.

beckermr avatar Sep 18 '17 16:09 beckermr

The types will be populated without a full table scanner run. You can use something like

db_id = client.get_database_id(database)
tables = client.tables.list(database_id=db_id, schema=schema, name=table)
columns = client.tables.get(tables[0].id).columns
dtypes = {col.name: redshift_to_py_type(col.sql_type)
              for col in columns}

stephen-hoover avatar Sep 18 '17 16:09 stephen-hoover

Here is the function I plan on using:

from civis.APIClient import get_table_id

gdj0nes avatar Sep 18 '17 16:09 gdj0nes

Initial code where table_id is passed from read_civis into read_civis_sql

    if use_pandas:
        table_meta = client.tables.get(table_id)
        # The split is to handle e.g. DECIMAL(17,9)
        redshift_types = {col['name']: col['sql_type'].split('(')[0]
                              for col in table_meta['columns']}
        py_types = {name : SQL_PANDAS_MAP[type_]
                    for (name, type_) in redshift_types 
                    if type_ not in DATE_TYPES}
        date_cols = [name for (name, type_) in redshift_types 
                     if type_ in DATE_TYPES]
                     
        sortkeys = table_meta['sortkeys'].split(',')
        # Not sure about the best way to handle null fields
        # or how they may appear
        data = pd.read_csv(url, parse_dates=date_cols,
                                index_col=sort_keys,
                                converters=py_types)

gdj0nes avatar Sep 18 '17 17:09 gdj0nes

Cool. Go ahead and make a branch and start the PR. We can do actual code review there.

beckermr avatar Sep 18 '17 17:09 beckermr

This would be a major improvement. The difficult I've had with this before is that there doesn't seem to be a good way to get the type information from a general query passed into read_civis_sql. If someone has a suggestion, I would love to hear it, but I think we could only provide types for read_civis.

That said, it would still be useful to implement this for just read_civis. I could imagine somewhere around here adding the code @stephen-hoover wrote above. Then we could pass dtypes to read_civis_sql here, since the kwargs in read_civis_sql ultimately get passed to pandas.read_csv.

keithing avatar Sep 18 '17 22:09 keithing

Oof, we might have another problem. read_civis accepts columns, which could be arbitrary SQL. So if someone were to call

columns = ['my_var = 1']
read_civis('table', 'database', columns = columns)

We simply wouldn't have the type information.

keithing avatar Sep 18 '17 22:09 keithing

Hm. Good point. pandas.read_csv accepts a dtype argument, which is how we'd implement this. The dtype doesn't have to include all columns, and it can include columns which aren't present. I think the only time we'd run into trouble would be if users tried to export columns with the same name as a column in the actual table, but with a different type. I suppose that could happen, couldn't it?

I'm not sure what the right answer is. We could give a boolean switch to enable type inference when reading with pandas. That would let users at least turn it off if they're transforming types. Or we could provide examples of how to do this yourself, or helper functions.

stephen-hoover avatar Sep 18 '17 22:09 stephen-hoover