datashader
datashader copied to clipboard
Use DataFrame.index for x-axis
It is typical to use a DatetimeIndex
on Pandas DataFrame
and it works well when plotting/indexing so it would be nice to be able to specify e.g.
cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_width=w, plot_height=h)
cvs.line(df, 'index', 'y', agg=ds.any())
to draw a line.
Furthermore, if using Dask DataFrame
that loads out-of-memory data lazily, it is necessary to specify the range using the index or it will go through the entire DataFrame
every time which can be very slow.
What's the problem if you try the code above? Looks ok at first glance, but presumably doesn't work?
Sounds like we should add the hint about Dask to our documentation?
Well, two issues.
- you can't access the index of a
DataFrame
using it as a key e.g.df['index']
which seems to be the current way of accessing the data in aDataFrame
. - if the data is
datetime64
it throws aValueError
:
/home/sdka/anaconda3/lib/python3.5/site-packages/datashader/glyphs.py in validate(self, in_dshape)
25 def validate(self, in_dshape):
26 if not isreal(in_dshape.measure[self.x]):
---> 27 raise ValueError('x must be real')
28 elif not isreal(in_dshape.measure[self.y]):
29 raise ValueError('y must be real')
ValueError: x must be real
I see; you want to avoid having to do df['index_col'] = df.index
to turn it into a proper column. That makes sense, though it would be a bit awkward to support that, because we are indeed accessing the columns using df[...]
, which is currently a nice, readable syntax internally. But I only see six places in datashader/glyphs.py that access the df in that way, so conceivably we could replace all of them with a call to a helper function that recognizes some special keyword for the column name (None
, perhaps?) to use .index instead when needed. If you'd like to sketch that out as a PR I'd be happy to consider merging it.
datetime64 support seems more difficult, because we have various code that creates a reduction as np.zeros, np.full, etc. with a specified type, and so it would take some effort to make sure that all such cases are expressed in a way that works for different types. Could be done, but seems messy. Again, happy to consider a (separate) PR for that, if it's not too awkward!
Ok, I'll look into it when time permits. I think both issues needs to be addressed for Datashader to be of use with large timeseries.
I agree. Do you have any publicly available datasets with large timeseries to use for testing, if we do this? It would be great to have a concrete example with real-world data.
I don't think it is necessary to use a large dataset for demonstrative purposes. The important parts would rather be the lazy loading from multiple/chunked files and then only loading a subset of the data when not viewing the full range.
Loading only a subset of the data is a feature that would be useful for all data types, not just time series, and will definitely need to be implemented at some point.
+1 for supporting datetime as x-axis here.
In some cases, what I do to avoid creating another column is to pass in df.reset_index()
instead of df
, which then allows you to refer to the original index by name.
Hello!
Trying to view dataframe with points defined by coordinates using
agg = ds.Canvas().points(df, 'lon', 'lat')
type of each lat/lon value is 'numpy.float64'.
It throws the same error as you can see above for 'datetime64' type:
ValueError: x must be real
Could you please help me here?
thanks!
df.info() does it report all values as float, you may have string or NaN, etc.
On Fri, Jan 5, 2018 at 4:05 AM, georgyEgor [email protected] wrote:
Hello! Trying to view dataframe with points defined by coordinates using agg = ds.Canvas().points(df, 'lon', 'lat') type of each lat/lon value is 'numpy.float64'. It throws the same error as you can see above for 'datetime64' type:
ValueError: x must be real
Could you please help me here?
thanks!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bokeh/datashader/issues/218#issuecomment-355505574, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXVTdYQnbVQuWr9vQV9zB-DpW3YCQOhks5tHeXsgaJpZM4JmWOb .
thanks for answer, I removed all null values. But let me add more details about previous errors:
- Initially, I got '
AttributeError: module 'pandas.api.types' has no attribute 'CategoricalDtype'
' error for the following DataFrame: df.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21076 entries, 0 to 24541
Data columns (total 2 columns):
lon 21076 non-null float64
lat 21076 non-null float64
dtypes: float64(2)
memory usage: 494.0 KB
-
after that I resolve issue about CategoricalDtype as:
df['lon'] = df['lon'].astype('category') df['lat'] = df['lat'].astype('category')
-
And finally, for the df.info() as:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21076 entries, 0 to 21075
Data columns (total 2 columns):
lon 21076 non-null category
lat 21076 non-null category
dtypes: category(2)
memory usage: 371.5 KB
I got error 'ValueError: x must be real'
I'm not sure category is what you want for floats. can you post df.describe()
On Fri, Jan 5, 2018 at 5:23 AM, georgyEgor [email protected] wrote:
thanks for answer, I removed all null values. But let me add more details about previous errors:
- Initially, I got 'AttributeError: module 'pandas.api.types' has no attribute 'CategoricalDtype'' error for the following DataFrame: df.info():
<class 'pandas.core.frame.DataFrame'> Int64Index: 21076 entries, 0 to 24541 Data columns (total 2 columns): lon 21076 non-null float64 lat 21076 non-null float64 dtypes: float64(2) memory usage: 494.0 KB
after that I resolve issue about CategoricalDtype as: df['lon'] = df['lon'].astype('category') df['lat'] = df['lat'].astype('category') 2.
And finally, for the df.info() as:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21076 entries, 0 to 21075 Data columns (total 2 columns): lon 21076 non-null category lat 21076 non-null category dtypes: category(2) memory usage: 371.5 KB
I got error 'ValueError: x must be real'
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bokeh/datashader/issues/218#issuecomment-355521130, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXVTbeetk0rovMyTd4Iil1S3ynIy3Niks5tHfgEgaJpZM4JmWOb .
Sure, here it is:
| lon | lat
count | 21076.000000 | 21076.000000
unique | 18669.000000 | 18329.000000
top | -114.547784 | 32.612881
freq | 29.000000 | 29.000000
looks like it's not float as there are no mean, std, min, max an etc.
so is there a good solution to plotting the df index without creating a proper column for it now?
I don't think a PR for that has ever appeared (for the "see six places in datashader/glyphs.py that access the df in that way", which may be higher now), so it's still open as a proposed extension that you or someone else could contribute. For our own purposes, we just do create a proper column when needed.