Adding a NumPy backend
Data often starts out as NumPy arrays rather than Dask or Pandas dataframes, and it could be useful to work directly with such arrays (see #283). I briefly investigated how much work it would be to use an OrderedDict of NumPy arrays of the same size, by searching the code for "df" (maybe not reliable, but I think it covers everywhere the dataframe is accessed). Findings:
- Some usages of
dfthat would just work already for the OrderedDict case:
df[self.x].min()
df[self.x].max()
- Some accesses that are already just skipping to the underlying NumPy array anyway, so we could add a conditional or make a simple wrapper function to do the right thing in both cases:
df[y_name].values
-
Categorical data would require some extra work (we currently access
df[self.column].cat.codes.values); those values would need to be stored somewhere. -
We already have multi-dispatch to handle the actual computational bits, and we'd need an equivalent to datashader/pandas.py, which isn't much code but is fairly mysterious.
So it would be a bit of a project, but I think it would be feasible to support NumPy using arrays named in OrderedDicts. Something for the wishlist, or a good project for a motivated contributor who wants to really understand how everything fits together in datashader...
Dask may also be useful for this purpose: http://matthewrocklin.com/blog/work/2017/01/17/dask-images