datashader icon indicating copy to clipboard operation
datashader copied to clipboard

Adding a NumPy backend

Open jbednar opened this issue 8 years ago • 1 comments

Data often starts out as NumPy arrays rather than Dask or Pandas dataframes, and it could be useful to work directly with such arrays (see #283). I briefly investigated how much work it would be to use an OrderedDict of NumPy arrays of the same size, by searching the code for "df" (maybe not reliable, but I think it covers everywhere the dataframe is accessed). Findings:

  1. Some usages of df that would just work already for the OrderedDict case:
df[self.x].min()
df[self.x].max()
  1. Some accesses that are already just skipping to the underlying NumPy array anyway, so we could add a conditional or make a simple wrapper function to do the right thing in both cases:
df[y_name].values
  1. Categorical data would require some extra work (we currently access df[self.column].cat.codes.values); those values would need to be stored somewhere.

  2. We already have multi-dispatch to handle the actual computational bits, and we'd need an equivalent to datashader/pandas.py, which isn't much code but is fairly mysterious.

So it would be a bit of a project, but I think it would be feasible to support NumPy using arrays named in OrderedDicts. Something for the wishlist, or a good project for a motivated contributor who wants to really understand how everything fits together in datashader...

jbednar avatar Feb 23 '17 15:02 jbednar

Dask may also be useful for this purpose: http://matthewrocklin.com/blog/work/2017/01/17/dask-images

jbednar avatar Apr 28 '17 01:04 jbednar