lyra icon indicating copy to clipboard operation
lyra copied to clipboard

Support facet/aggregate transforms

Open arvind opened this issue 9 years ago • 5 comments

The facet and aggregate transforms require special consideration. Facets can be instantiated by dropzones on the edges of a group. This should link the group mark to the pipeline, and switch child marks to inherit their data. If the facet is manually instantiated, Lyra 1 would inject an additional group to prevent interference with any other children. In Lyra 1, facets were displayed as tabs in the data table, and their corresponding group marks could be laid out horizontally, vertically or layered. Aggregates are like facets but don't split the data up. Summary statistics can be calculated with either.

Some open questions:

  • Vega 1 used to allow aggregate statistics to be assigned to the input tuples. This allowed users to use an aggregate statistic in other transforms (e.g., formula as with Bertin's hotels). This is no longer available with Vega 2. One option is to use a lookup transform behind the scenes. Alternatively, we can investigate reintroducing this functionality in Vega/Datalib.
  • Lyra 1 would display aggregate statistics as a secondary data table. However, the user may wish to actually visualize the aggregate tuples (e.g., a histogram) rather than the source tuples. One option would be to make the backing data source clearer in the mark inspector. Another would be to hide the source data table when a mark drawn over the aggregates is selected...
  • Lyra 1 had a "once per group" checkbox for marks and scales to circumvent the above. However, this is quite unintuitive. How can we improve on this?

arvind avatar Feb 15 '16 03:02 arvind

Compiling some notes after some brainstorming sessions with @AnjirHossain.

Hovering over a field in a Pipeline's data table will display an additional "aggregation" icon. Clicking the icon expands a list of possible aggregated versions of the field. E.g., if we hover over Horsepower, the list will contain sum_Horsepower, mean_Horsepower, etc. We should show common/frequently fields first, and then show the rest of them behind a "Show/Hide More" toggle link. We can use Tableau and other tools to guide our definition of "common," and the Vega docs contain a full list of aggregation operators that we should cull for usefulness within a Lyra context (e.g., values is likely confusing?).

Users can then drag a field off this list to trigger the normal data binding process. Dropzones will appear on the canvas, and when the user drops the field over one of them, the bindChannel action creator gets called.

Recall a couple of things:

  1. The Lyra store is essentially a Vega specification with some additional attributes to track Lyra-specific things. So, for example, if you toJS() a mark in the Lyra store, you'll see that it closely matches the specification for a mark in Vega.
  2. Vega specifications only contain definitions for datasets (which describe where the data is loaded from and an array of transformations). Lyra, on the other hand, has not only datasets but also pipelines. The idea behind Lyra pipelines is to group together related datasets. Each pipeline consists of a single _source dataset, and then a number of derived datasets. Currently, there is a 1-1 correspondence between pipelines and datasets, but this PR will begin to change that.

arvind avatar Aug 22 '16 19:08 arvind

Back to bindChannel. When a user drops a field on a dropzone, we need to know how to update the Lyra store. While we could write our own production rules to determine the necessary changes, it's easier to leverage work done with Vega-Lite. Each mark in the Lyra store also defines a _vlUnit, a Vega-Lite "unit" specification. When a user drops a field over a mark's dropzone, a corresponding entry is added to the mark's _vlUnit, which is then compiled by Vega-Lite (parsed.input) to produce a Vega specification (parsed.output).

The parse* modules under bindChannel then analyze the constituent parts of the output Vega specification and make any needed changes to the Lyra store. So, for example, parseScales analyzes the scales property in the output Vega specification. It makes use of some things we know about the structure of Vega specifications produced by Vega-Lite including:

  • With unit specifications, the Vega specification contains a group mark within which all other marks, scales, axes, and legends are defined.
  • Scales are named for their Vega-Lite channel (so we get scales named x, y, color, and so forth).

While not every step is necessary for every parser, the overall parsing logic can be broken down into the following steps:

  1. Find the specific object in the output Vega specification corresponding to the most recent drag-and-drop operation. E.g., find the scale associated with the dropzone the user just interacted with.
  2. Normalize the result of step (1) such that it reflects the ways properties are set in Lyra. For example, see the comments associated with parseScales::parse and parseMarks::rectSpatial.
  3. Determine whether the Lyra store already contains something similar to the result of step (2). If it does, either leave it alone or update it as appropriate. If it does not, create it.
  4. Finally, each _vlUnit also contains a _map (exposed as parsed.map). The purpose of this map is to translate from Vega's identifiers (names) to Lyra's identifiers (numbers). Thus, if parseScale finds a scale named color in the Vega specification, and creates a Lyra scale with ID 7, parsed.map.scales.color = 7;. Building this map is critical for ensuring that any other references made to parsed objects get wired up correctly. For example, a mark in the Vega specification may say "fill": {"scale": "color", "field": "Origin"} and we use the _map to ensure the Lyra mark uses scale ID 7.

arvind avatar Aug 22 '16 19:08 arvind

So, with aggregation, our first task is to ensure the aggregate property is correctly set in the Vega-Lite unit specification. When a channel is set to aggregate, Vega-Lite produces an additional Vega dataset named summary which inherits data from the source dataset and then transforms it using the aggregate transformation. Thus, with this PR, our attention will be focused on fleshing out parseData and the other necessary infrastructure (action creators, reducer logic, etc.) to change the Lyra store in response.

Let's use a running example with the cars dataset of aggregating mean_Horsepower on the Y axis, grouped by the Origin along the X. Users can specify this chart in two ways: (1) Drag Origin first, and then mean_Horsepower; (2) Dragging mean_Horsepower first, and then Origin. In the first case, a summary dataset is only produced after the second drag-and-drop, and its aggregate transform definition will be complete (i.e., it will contain non-empty groupby and summarize properties). In the second case, a summary dataset is produced right away; its aggregate transform will contain a summarize but an empty groupby.

We should be able to support both workflows, but the first provides us a good starting point. When we experience a summary dataset, we need to mimic the flow highlighted above including checking if a matching dataset already exists in Lyra, updating/creating it as necessary, and then populating the parsed.map.

How do we check if Lyra already has a matching dataset? One option would be similar to the existing parsers -- iterate through all the datasets and perform a comparison. However, we can do one better and see if the associated pipeline has a matching aggregated dataset. The key identifier for aggregated datasets is the fields they groupby. If we find an aggregated dataset with the necessary groupby fields, we can just add to its summarize property. Thus, our pipelines need to track aggregated datasets, building a map to go from a string of groupby fields to the ID of the Lyra datasets (pipeline._aggregates = {}).

If we cannot find a matching dataset, we need to create one. The flow for this should look similar to the addPipeline flow -- the action should be on the pipeline (e.g., aggregatePipeline) which should in turn dispatch the necessary dataset actions (i.e., addDataset). As the aggregated dataset is derived from the source, we do not need to provide any raw values to the action creator. We do, however, have to manually construct a schema. The groupby fields will have the same type definition as found in the source dataset; and then we will get a new field for each summary operator that will be a quantitative number. So, going back to our running example, we might see this aggregate transform:

"transform": [{
  "type": "aggregate",
  "groupby": ["Origin"],
  "summarize": {"Horsepower": ["mean"]}
}]

Our schema would include Origin (taken from the source's schema), and then mean_Horsepower that we manually construct (notice that aggregated fields are named in the form of aggop_sourcefieldname:

schema = {
  Origin: {
    name: "Origin",    
    type: "string",
    mtype: "nominal"
  },
  mean_Horsepower: {
    name: "mean_Horsepower",
    type: "number",
    mtype: "quantitative"
  }
}

Since aggregatePipeline will call addDataset directly, we can pass in the transform definitions directly via props rather than adding additional transform actions for now.

Once the dataset has been updated/created, do not forget to add an entry to parsed.map to track the ID of the summary dataset. Downstream parsers (i.e., scales and marks) which reference "data": "summary" will use this map to lookup the correct Lyra ID. If you forget this step, your output Vega specification will have a lot of "data": undefined in there.

arvind avatar Aug 22 '16 19:08 arvind

The above flow sketches out the changes we need to make to bindChannel and related infrastructure to begin to support aggregated fields. We will also want to make necessary changes to our interface. The first such change will be to display all of a pipeline's aggregated datasets as DataTables underneath the source. Fields should be coloured yellow (using the .derived class rather than .source) rather than green. We can demarcate each aggregated dataset with a grey bar (and relabel the existing one as Source perhaps).

Once this is done, we should also provide a simple inline property inspector to allow users to inspect/update aggregates manually (i.e., it should list the groupby fields and aggregated fields, and users can drag additional fields from the source data table into the property inspector). This latter drag-and-drop functionality should be implemented in a generalizable way to allow dropping of any fields on any <Property /> element.

arvind avatar Aug 22 '16 19:08 arvind

Based on your most recent commits, @AnjirHossain, here's some additional clarification. If I import the Cars dataset, I'll see a dataset with ID 4 in the store that looks like the following:

{
  "_id": 4,
  "_parent": 3,
  "name": "cars",
  "url": "/data/cars.json",
  "format": {
    "parse": "auto",
    "type": "json"
  }
}

It's pipeline has ID 3 and the following in the store:

{
  "_id": 3,
  "_source": 4,
  "name": "cars"
}

Now, if I drag out Origin and mean_Horsepower, my store will look like so (IDs may vary):

"pipelines": {
  3: {
    "_id": 3,
    "_source": 4,
    "name": "cars",
    "_aggregates": {
      "Origin": 5
    }
  }
},
"datasets": {
  4: {...},
  5: {
    "_id": 5,
    "_parent": 3,
    "name": "cars_groupby_Origin",
    "source": 4,
    "transform": [{
      "type": "aggregate",
      "groupby": ["Origin"],
      "summarize": {"Horsepower": ["mean"]}
    }]
  }
}

And parsed.map.data.summary should equal 5. Hope this helps!

arvind avatar Aug 22 '16 22:08 arvind