datashader icon indicating copy to clipboard operation
datashader copied to clipboard

Improving Datashader's API

Open jbednar opened this issue 6 years ago • 10 comments

Datashader's API was initially designed mainly to help exercise the underlying computations, not to make things straightforward for the user. As the library has matured, the main thing keeping the version number below 1.0 has been the desire to make a cleaner API for end users before settling on it. This issue outlines a proposed new API that would be used to help datashader reach 1.0.

Current API

The current API for making an image from a dataframe is a bit muddled:

cvs = ds.Canvas(plot_width=200, plot_height=200, x_range=(-8,8), y_range=(-8,8))
agg = cvs.points(df,'x','y',agg=ds.reductions.count()) 
img = tf.shade(agg)

(There is also another way to make an agg, using agg = ds.bypixel(df, cvs, ds.glyphs.Point('x','y'), ds.reductions.count()), but this syntax only works for lines and points and is even more confusing for the end user, so we'll ignore it and hopefully delete mentions of it until that too can be cleaned up. Also ignoring Pipeline, which is not widely used and can probably be deleted.)

The above formulation is at odds with the pipeline diagram laid out in the pipeline.ipynb documentation:

pipeline diagram

The advertised pipeline starts with a "Scene" object that is somewhere in between the Canvas and aggregate objects here -- it includes everything in Canvas, plus the specification of a specific glyph that's currently not known until the points() (or line() or raster()) call. Introducing the glyph and the columns so late is odd, because those are the key bits of metadata that determines what any of this plot will be about. For a given pair of columns, the glyph type is arguably an implicit property of the dataframe itself, i.e. a declaration of what that data represents. Introducing such key metadata only at the moment of aggregation seems backwards; we'll never normally want to change that for an aggregation call of a particular set of columns . Conversely, the width and height are properties of the resulting aggregation, and so it's odd for them not to be specifiable in the call that actually generates the aggregation. So it seems to be divided up between stages in a confusing and not very helpful way that makes it difficult to describe just what Canvas is or does. For these reasons, in practice, people end up building a Canvas and making an aggregation as a single step, so having Canvas as some separate object is not currently achieving anything obvious.

Proposed User API 1.0:

Combine the Canvas object and the canvas glyph methods into three new classes ds.Points, ds.Lines, and ds.Raster, each inheriting from a new superclass ds.Scene. Each of the Scene classes will contain all data and specifications needed to create an aggregate, but these values can be overridden during an aggregate() call if desired:

scene = ds.Points(df,'x','y',x_range=(-8,8), y_range=(-8,8),agg=ds.reductions.count()))
agg = scene.aggregate()
img = tf.shade(agg)

Here the user is forced to choose a glyph type at the same time as supplying the dataframe and column names, which makes sense semantically to me and is certainly the set of information that must always be provided from the user. Any other information can be specified or overridden by the user at any level they prefer:

  • The Scene class defines common parameters that can be set as class attributes at this superclass level, applying to all Scene types
  • Specific Scene subclasses define additional parameters or set defaults specific to that type of Scene that can be set or changed as class attributes at that level
  • A specific Scene class can be instantiated with specific parameter values
  • An aggregate() call can override any of these values
  • shade() takes its own options applying only to that stage, with similar Parameterized support

In this way, a user can specify what is known from the start (that the data represents Points or Lines or a Raster), and can then change only whatever is needed to change in a specific aggregation step (typically ranges, resolutions, and aggregator). See partial implementation below.

Here I've used the method name aggregate() for concreteness, but we could presumably use __call__, since aggregating is the one obvious operation to be done on a Scene.

Proposed User API extension: Rich display for Scenes

Once the above API has been implemented, the Scene object will now contain all of the information necessary to render a default image, as you can see by the fact that the aggregate() and shade calls don't require any arguments. Thus we can consider making the Scene object visualize itself in a Jupyter notebook by default, just as the Image objects returned by tf.shade do, reducing the minimal invocation for a datashader image to just:

ds.Points(df,'x','y')

(which would render using all the default options to a PNG visible with Jupyter's rich display support). This seems convenient, but it does seem difficult to know where to stop, because we could then make this display be configurable by making tf.shade be a Parameterized object and instantiating it in the Scene class, allowing the user to change its parameters if desired. And then people would want to define an optional transformation step on the aggregate, dynamic spreading on the final result, and so on, ending up with another version of Pipeline.

Perhaps it would be safer to provide somewhat similar levels of convenience by making shade be an operation like those in holoviews.operation.datashader:

shade(ds.Points(df,'x','y'))

In this way shade can accept its own options, while still conveniently displaying in a notebook.

For this to work, shade would have to check to see if it's been given a Scene rather than an xarray, and would call it to do the aggregation first if need be. Without that extra code, it's a bit uglier, but not too bad assuming we use __call__ syntax:

shade(ds.Points(df,'x','y')())

Current Implementation

The various glyph types (points, lines, and rasters) are defined as objects in datashader.glyph, but they are actually typically accessed as methods on a Canvas object that doesn't do much other than access them:

class Canvas(object):
    def __init__(self, plot_width=600, plot_height=600, x_range=None, y_range=None, x_axis_type='linear', y_axis_type='linear'):
    def points(self, source, x, y, agg=None):
    def line(self, source, x, y, agg=None):
    def raster(self, source, layer=None, upsample_method='linear', downsample_method='mean', nan_value=None): 

Proposed Implementation

class Scene(Parameterized):
    width  = param.Integer(default=600, doc="Width of aggregate array to create")
    height = param.Integer(default=600, doc="Height of aggregate array to create")
    x_range = param.NumericTuple(default=None)
    y_range = param.NumericTuple(default=None)
    x_axis_type = param.ObjectSelector(default='linear',objects=['linear','log'])
    y_axis_type = param.ObjectSelector(default='linear',objects=['linear','log'])
    agg = param.ClassSelector(class_=Reduction, default=None)
    
    def __init__(self,source,**params):
        self.source = source

    def aggregate(self,**params):
        p=paramOverrides(params)
        return bypixel(self.source, self, self.glyph(x, y), p.agg)
    
class Points(Scene):
    agg = param.ClassSelector(count())
    glyph = glyph.Point

class Lines(Scene):
    agg = param.ClassSelector(any())
    glyph = glyph.Line    

class Raster(Scene):
    agg = param.ClassSelector(mean())
    interpolator = param.ClassSelector(class_=Upsample, default=linear())
    nan_value = param.Number(default=None)
    layer = param.Integer(default=None)
    
    def aggregate(self):
        ....

Migration Path

The new Scene objects would go into datashader/scene.py, and would be imported into the top level. scene.py would currently import the required implementation from core.py. These objects shoudn't have any effect on the existing Canvas object, which can be retained while we remove all instances of it (and of Pipeline) from examples. At that point Canvas can be deprecated and eventually removed.

jbednar avatar Oct 07 '17 12:10 jbednar

I do agree that the API can be streamlined but I'm not wholly convinced by this proposal. I think separating the canvas or scene from the glyph was a solid design decision and this breaks that design. Shoving the data, the glyph columns, the ranges and the aggregation into a single declaration makes things easy but is imo conceptually problematic.

Defining a canvas/scene to draw on independent of the glyph means you can define a gridding which you can aggregate multiple things onto, e.g. to perform statistical operations. My main problem with what you describe is that it's not clear to me what the point is of even creating a scene object in this approach. Once you've created the scene object, what, except for calling scene.aggregate(), can you do with it? In other words, is there any reason why ds.Points(...) wouldn't directly return the aggregate in your proposal?

If you really wedded to the idea of combining the glyph and the scene, I would, at the very least, separate the aggregate, so that you can do:

scene = ds.Points(df,'x','y', x_range=(-8,8), y_range=(-8,8))
agg = scene.aggregate(ds.count())

The HoloViews approach is to declare the glyph-like object and then aggregating that by specifying the gridding and aggregation so I don't think there's much point to replicating that API.

Personally I'd recommend making some kind of convenience API but leave the fundamental separation of canvas/scene, glyph and aggregates in place.

philippjfr avatar Oct 07 '17 13:10 philippjfr

Personally I'd recommend making some kind of convenience API but leave the fundamental separation of canvas/scene, glyph and aggregates in place.

That sounds reasonable, but unfortunately that separation doesn't currently exist. There is a separate set of Glyph classes, but (a) it is incomplete, with only Point and Line, not Raster, and (b) in practice all the glyph stuff is only accessed through Canvas, either through convenience functions (Canvas.points and Canvas.lines) or as the only viable way to build it (Canvas.raster). So I'm not only trying to make something convenient, I'm trying to make it conceptually more straightforward. (The reason this is coming up is that I'm trying to work on #437, and finding it nearly impossible to explain things in any reasonable way, either for usage or for how things really work.)

As an aside, your code above would actually already work in my proposal above, if modified to scene.aggregate(agg=ds.count()), because the aggregate() call can accept and override any parameter as needed.

Still, taking your point that it would be nice to have a Canvas-like object that defines the scene independently of the dataframe and glyph specification, how about:

Proposed User API 2.0:

Same as 1.0, but with Scene being a separate object, not a superclass, owned by the Points, Lines, and Rasters objects, so that (a) it can be shared and passed around, but (b) it can be declared at the class level and re-used without having to supply it all the time.

One implication of this approach is that whenever the user does need to change the Scene parameters, they have to instantiate a Scene object explicitly and pass it in to the Points or LIne or Raster object, rather than being able to just throw the changed parameters into the same object instantiation. So a bit less convenient in the typical case of not wanting to pass around the Scene object.

is there any reason why ds.Points(...) wouldn't directly return the aggregate in your proposal?

Sure -- by not returning the aggregate, it can be called repeatedly with any parameter changed -- different bounds, different resolution, different agg reduction, etc. So there's no reason to return the aggregate at first, except possibly to support rich display (as outlined in "Proposed User API extension" above).

jbednar avatar Oct 07 '17 14:10 jbednar

To me that is still more conceptually confusing since what you're calling now scene is really just the glyph with some default parameters tacked on. To put it another way, the absolute minimum declaration for ds.Points, based on what you've been saying, would be:

glyph = ds.Points(df,'x','y')

since all the other parameters can also be supplied to the aggregate call:

agg = glyph.aggregate(x_range=(-8,8), y_range=(-8,8), agg=ds.reductions.count())

That is a reasonable proposal and is fairly close to what HoloViews does, but I can't help but feel combining everything into one object, effectively making the glyph and the scene synonymous confuses things and actually makes it harder to explain. I think there is a canvas/scene centric approach (basically what datashader does not) and a data/glyph centric approach (what holoviews does and ds.Points(df,'x','y') would do) but that blurring the distinction is confusing.

philippjfr avatar Oct 07 '17 17:10 philippjfr

How would a glyph-centric approach work in this case?

In API 2.0, there are separate objects for the scene and the glyph type, but you object to one being owned by the other?

jbednar avatar Oct 07 '17 18:10 jbednar

Sorry, I somehow completely missed your new proposal. That does sound reasonable although a concrete example of what you're thinking of would help.

philippjfr avatar Oct 07 '17 18:10 philippjfr

I guess the reason I am drawn to a single object type here (for api 1.0) is failing to see a clean way to separate things. They most definitely are not cleanly separated now, but if you can see a way out...

jbednar avatar Oct 07 '17 18:10 jbednar

In 2.0, Scene is clear, but I'm not sure what the others would be called. There are already glyphs (e.g. Line), so what are these (e.g. Lines)?

jbednar avatar Oct 07 '17 18:10 jbednar

The API 2.0 proposal makes more sense to me and sound more flexible to me.

If you really wedded to the idea of combining the glyph and the scene, I would, at the very least, separate the aggregate

I agree with Philipp's suggestion here. Using .aggregate(ds.count()) makes more sense than specifying ds.count() before the aggregate method call.

Here I've used the method name aggregate() for concreteness, but we could presumably use __call__, since aggregating is the one obvious operation to be done on a Scene.

In this case, I think calling it aggregate is nice and explicit as I don't find the use of __call__ here obvious. The example as stated:

scene = ds.Points(df,'x','y',x_range=(-8,8), y_range=(-8,8),agg=ds.reductions.count()))
agg = scene.aggregate()

seems a lot clearer to me than:

scene = ds.Points(df,'x','y',x_range=(-8,8), y_range=(-8,8),agg=ds.reductions.count()))
agg = scene()

or

Points(...)()

I'm slightly worried about having both ds.Points and hv.Points but hopefully there won't be many situations where you would need both. Anyway, those are my current thoughts on this proposal.

jlstevens avatar Oct 08 '17 12:10 jlstevens

Just to keep things interesting, how about the converse of API 2.0:

Proposed User API 3.0:

Let's tentatively call the Points, Lines, and Raster objects GlyphSets, just so I can have a way to talk about them. Better name suggestions would be greatly appreciated.

API 3.0 is the same as 2.0, with Scene as a separate object, but instead of the GlyphSets owning a Scene, a Scene would own some list of GlyphSets. That's actually the more typical way that a drawing or rendering program would work: define a workspace of some sort, then populate it with objects that combine in some configurable way. In a GUI toolkit or 2D drawing program, those objects would typically overlay, with some transparency, while in the 3D toolkits for which datashader is named, those objects would typically occlude (again possibly with some transparency).

To make this work with datashader,

  1. Scene would keep a list of GlyphSets.
  2. Each GlyphSet would have an associated aggregate reduction operator (mirroring the current calls like .points(self, source, x, y, agg=None).
  3. Scene.aggregate() would aggregate each of the provided GlyphSets, perhaps into a list or dictionary.
  4. A list of aggregates isn't immediately visualizable, so each GlyphSet would need an associated shader object.
  5. Finally, we'd need a specification for how separate shaded aggregates would be combined, e.g. tf.Stack with a configurable image-reduction operator. With that, we could call Scene.shade() to shade and combine the current aggregates, and maybe Scene.datashade() to do both aggregation and shading.

I might have missed some complexities here, but from what I can see there are pros and cons:

  • Pro: Seems easy to describe at a high level (first make a Scene, then populate it with one or more GlyphSets, then you can render the result).
  • Con: Seems a bit difficult to describe in detail (given my attempt above)
  • Pro: Allows encapsulating a workflow as an executable object, to be rendered as needed (e.g. in a callback)
  • Con: Requires encapsulating a workflow as an object, which can be difficult to think about compared to an imperative series of commands
  • Con: Not sure how to express arbitrary xarray manipulations, such as the ones needed for combining pickups and dropoffs for NYC taxi data, or selecting the top 1% of values. I guess our trusty concept of output_fn would do that?

Any other comments on option 3.0? To me it sounds like something that's best provided on top of some lower-level API if we did offer it, but I'm not sure how best to break that down. Maybe keep Canvas as the parts of Scene that don't know about GlyphSets (removing .points, .lines, etc.) and then have a Scene own both a Canvas and some GlyphSets?

jbednar avatar Oct 11 '17 14:10 jbednar

At a high level, I like API 3.0, it reminds me of adding elements to a scenegraph which as you mentioned is a very common concept in graphics APIs.

I need to think more about how it might map to the components of datashader but I'll say that my initial reaction is positive because it is such a common way of working. I don't think you suggested this but I can also imagine associating the aggregators with the elements as you add them to the scenegraph.

jlstevens avatar Oct 11 '17 17:10 jlstevens