datashader icon indicating copy to clipboard operation
datashader copied to clipboard

Use DataFrame.index for x-axis

Open simonkamronn opened this issue 8 years ago • 15 comments

It is typical to use a DatetimeIndex on Pandas DataFrame and it works well when plotting/indexing so it would be nice to be able to specify e.g.

cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_width=w, plot_height=h)
cvs.line(df, 'index', 'y', agg=ds.any())

to draw a line.

Furthermore, if using Dask DataFrame that loads out-of-memory data lazily, it is necessary to specify the range using the index or it will go through the entire DataFrame every time which can be very slow.

simonkamronn avatar Aug 17 '16 11:08 simonkamronn

What's the problem if you try the code above? Looks ok at first glance, but presumably doesn't work?

Sounds like we should add the hint about Dask to our documentation?

jbednar avatar Aug 17 '16 16:08 jbednar

Well, two issues.

  1. you can't access the index of a DataFrame using it as a key e.g. df['index'] which seems to be the current way of accessing the data in a DataFrame.
  2. if the data is datetime64 it throws a ValueError:
/home/sdka/anaconda3/lib/python3.5/site-packages/datashader/glyphs.py in validate(self, in_dshape)
     25     def validate(self, in_dshape):
     26         if not isreal(in_dshape.measure[self.x]):
---> 27             raise ValueError('x must be real')
     28         elif not isreal(in_dshape.measure[self.y]):
     29             raise ValueError('y must be real')

ValueError: x must be real

simonkamronn avatar Aug 17 '16 18:08 simonkamronn

I see; you want to avoid having to do df['index_col'] = df.index to turn it into a proper column. That makes sense, though it would be a bit awkward to support that, because we are indeed accessing the columns using df[...], which is currently a nice, readable syntax internally. But I only see six places in datashader/glyphs.py that access the df in that way, so conceivably we could replace all of them with a call to a helper function that recognizes some special keyword for the column name (None, perhaps?) to use .index instead when needed. If you'd like to sketch that out as a PR I'd be happy to consider merging it.

datetime64 support seems more difficult, because we have various code that creates a reduction as np.zeros, np.full, etc. with a specified type, and so it would take some effort to make sure that all such cases are expressed in a way that works for different types. Could be done, but seems messy. Again, happy to consider a (separate) PR for that, if it's not too awkward!

jbednar avatar Aug 17 '16 20:08 jbednar

Ok, I'll look into it when time permits. I think both issues needs to be addressed for Datashader to be of use with large timeseries.

simonkamronn avatar Aug 18 '16 08:08 simonkamronn

I agree. Do you have any publicly available datasets with large timeseries to use for testing, if we do this? It would be great to have a concrete example with real-world data.

jbednar avatar Aug 19 '16 04:08 jbednar

I don't think it is necessary to use a large dataset for demonstrative purposes. The important parts would rather be the lazy loading from multiple/chunked files and then only loading a subset of the data when not viewing the full range.

simonkamronn avatar Aug 20 '16 12:08 simonkamronn

Loading only a subset of the data is a feature that would be useful for all data types, not just time series, and will definitely need to be implemented at some point.

jbednar avatar Aug 22 '16 12:08 jbednar

+1 for supporting datetime as x-axis here.

In some cases, what I do to avoid creating another column is to pass in df.reset_index() instead of df, which then allows you to refer to the original index by name.

esvhd avatar Jun 22 '17 20:06 esvhd

Hello! Trying to view dataframe with points defined by coordinates using agg = ds.Canvas().points(df, 'lon', 'lat') type of each lat/lon value is 'numpy.float64'. It throws the same error as you can see above for 'datetime64' type:

ValueError: x must be real

Could you please help me here?

thanks!

georgyEgor avatar Jan 05 '18 09:01 georgyEgor

df.info() does it report all values as float, you may have string or NaN, etc.

On Fri, Jan 5, 2018 at 4:05 AM, georgyEgor [email protected] wrote:

Hello! Trying to view dataframe with points defined by coordinates using agg = ds.Canvas().points(df, 'lon', 'lat') type of each lat/lon value is 'numpy.float64'. It throws the same error as you can see above for 'datetime64' type:

ValueError: x must be real

Could you please help me here?

thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bokeh/datashader/issues/218#issuecomment-355505574, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXVTdYQnbVQuWr9vQV9zB-DpW3YCQOhks5tHeXsgaJpZM4JmWOb .

apiszcz avatar Jan 05 '18 10:01 apiszcz

thanks for answer, I removed all null values. But let me add more details about previous errors:

  1. Initially, I got 'AttributeError: module 'pandas.api.types' has no attribute 'CategoricalDtype'' error for the following DataFrame: df.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21076 entries, 0 to 24541
Data columns (total 2 columns):
lon    21076 non-null float64
lat    21076 non-null float64
dtypes: float64(2)
memory usage: 494.0 KB

  1. after that I resolve issue about CategoricalDtype as: df['lon'] = df['lon'].astype('category') df['lat'] = df['lat'].astype('category')

  2. And finally, for the df.info() as:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21076 entries, 0 to 21075
Data columns (total 2 columns):
lon    21076 non-null category
lat    21076 non-null category
dtypes: category(2)
memory usage: 371.5 KB

I got error 'ValueError: x must be real'

georgyEgor avatar Jan 05 '18 10:01 georgyEgor

I'm not sure category is what you want for floats. can you post df.describe()

On Fri, Jan 5, 2018 at 5:23 AM, georgyEgor [email protected] wrote:

thanks for answer, I removed all null values. But let me add more details about previous errors:

  1. Initially, I got 'AttributeError: module 'pandas.api.types' has no attribute 'CategoricalDtype'' error for the following DataFrame: df.info():

<class 'pandas.core.frame.DataFrame'> Int64Index: 21076 entries, 0 to 24541 Data columns (total 2 columns): lon 21076 non-null float64 lat 21076 non-null float64 dtypes: float64(2) memory usage: 494.0 KB

after that I resolve issue about CategoricalDtype as: df['lon'] = df['lon'].astype('category') df['lat'] = df['lat'].astype('category') 2.

And finally, for the df.info() as:

<class 'pandas.core.frame.DataFrame'> RangeIndex: 21076 entries, 0 to 21075 Data columns (total 2 columns): lon 21076 non-null category lat 21076 non-null category dtypes: category(2) memory usage: 371.5 KB

I got error 'ValueError: x must be real'

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bokeh/datashader/issues/218#issuecomment-355521130, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXVTbeetk0rovMyTd4Iil1S3ynIy3Niks5tHfgEgaJpZM4JmWOb .

apiszcz avatar Jan 05 '18 11:01 apiszcz

Sure, here it is:

     | lon | lat
count | 21076.000000 | 21076.000000
unique | 18669.000000 | 18329.000000
top | -114.547784 | 32.612881
freq | 29.000000 | 29.000000

looks like it's not float as there are no mean, std, min, max an etc.

georgyEgor avatar Jan 05 '18 11:01 georgyEgor

so is there a good solution to plotting the df index without creating a proper column for it now?

fogx avatar Nov 01 '19 14:11 fogx

I don't think a PR for that has ever appeared (for the "see six places in datashader/glyphs.py that access the df in that way", which may be higher now), so it's still open as a proposed extension that you or someone else could contribute. For our own purposes, we just do create a proper column when needed.

jbednar avatar Nov 01 '19 15:11 jbednar