unyt icon indicating copy to clipboard operation
unyt copied to clipboard

Do something more sensible with data from pandas

Open ngoldbaum opened this issue 6 years ago • 2 comments

  • unyt version: v2.2.0+7.g5d3ace5'
  • Python version: 3.6.8
  • Operating System: Ubuntu 18.04

Description

If you apply units to a pandas dataframe you get back something that doesn't actually have any units:

In [1]: import unyt as u
data
In [2]: import pandas as pd

In [3]: data = pd.read_csv('/home/goldbaum/Documents/rc-co2monitor/co2data.csv')

In [4]: t = data['Temperature']*u.degC

In [5]: t.units
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-7e2982815421> in <module>
----> 1 t.units

~/.pyenv/versions/3.6.8/lib/python3.6/site--packages/pandas/core/generic.py in __getattr__(self, name)
   5065             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5066                 return self[name]
-> 5067             return object.__getattribute__(self, name)
   5068
   5069     def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'units'

In [6]: type(t)
Out[6]: pandas.core.series.Series

Adding full support for pandas data types may be a lot to ask for, in which case we should somehow detect whether we're handed a pandas series or dataframe (preferably without needing to actually import pandas) and then raise an error telling the user to convert data to numpy arrays first.

ngoldbaum avatar Jun 20 '19 13:06 ngoldbaum

Another option, and a very light touch to unyt, is to register an accessor with pandas. I have prototyped this and usage looks like:

>>> import pandas as pd
>>> import unyt
>>> data = pd.DataFrame({"Temperature":[0.0, 23.0, 55.0]})
>>> data.Temperature.unyt.set_units("degC")
unyt_array([ 0., 23., 55.], 'degC')

Is this approach of interest?

l-johnston avatar Jul 17 '20 16:07 l-johnston

I’d probably need to see more details on how this would work inside a pandas workflow. Feel free to open a PR but please do include some usage examples that demonstrate how this would be useful.

I’d also like it if we could avoid importing pandas (or at least delay importing pandas until it’s needed) as that would increase the import time cost for the whole library.

ngoldbaum avatar Jul 17 '20 16:07 ngoldbaum