lyra
lyra copied to clipboard
Support facet/aggregate transforms
The facet and aggregate transforms require special consideration. Facets can be instantiated by dropzones on the edges of a group. This should link the group mark to the pipeline, and switch child marks to inherit their data. If the facet is manually instantiated, Lyra 1 would inject an additional group to prevent interference with any other children. In Lyra 1, facets were displayed as tabs in the data table, and their corresponding group marks could be laid out horizontally, vertically or layered. Aggregates are like facets but don't split the data up. Summary statistics can be calculated with either.
Some open questions:
- Vega 1 used to allow aggregate statistics to be assigned to the input tuples. This allowed users to use an aggregate statistic in other transforms (e.g., formula as with Bertin's hotels). This is no longer available with Vega 2. One option is to use a lookup transform behind the scenes. Alternatively, we can investigate reintroducing this functionality in Vega/Datalib.
- Lyra 1 would display aggregate statistics as a secondary data table. However, the user may wish to actually visualize the aggregate tuples (e.g., a histogram) rather than the source tuples. One option would be to make the backing data source clearer in the mark inspector. Another would be to hide the source data table when a mark drawn over the aggregates is selected...
- Lyra 1 had a "once per group" checkbox for marks and scales to circumvent the above. However, this is quite unintuitive. How can we improve on this?
Compiling some notes after some brainstorming sessions with @AnjirHossain.
Hovering over a field in a Pipeline's data table will display an additional "aggregation" icon. Clicking the icon expands a list of possible aggregated versions of the field. E.g., if we hover over Horsepower
, the list will contain sum_Horsepower
, mean_Horsepower
, etc. We should show common/frequently fields first, and then show the rest of them behind a "Show/Hide More" toggle link. We can use Tableau and other tools to guide our definition of "common," and the Vega docs contain a full list of aggregation operators that we should cull for usefulness within a Lyra context (e.g., values
is likely confusing?).
Users can then drag a field off this list to trigger the normal data binding process. Dropzones will appear on the canvas, and when the user drops the field over one of them, the bindChannel
action creator gets called.
Recall a couple of things:
- The Lyra store is essentially a Vega specification with some additional attributes to track Lyra-specific things. So, for example, if you
toJS()
a mark in the Lyra store, you'll see that it closely matches the specification for a mark in Vega. - Vega specifications only contain definitions for datasets (which describe where the data is loaded from and an array of transformations). Lyra, on the other hand, has not only datasets but also pipelines. The idea behind Lyra pipelines is to group together related datasets. Each pipeline consists of a single
_source
dataset, and then a number of derived datasets. Currently, there is a 1-1 correspondence between pipelines and datasets, but this PR will begin to change that.
Back to bindChannel
. When a user drops a field on a dropzone, we need to know how to update the Lyra store. While we could write our own production rules to determine the necessary changes, it's easier to leverage work done with Vega-Lite. Each mark in the Lyra store also defines a _vlUnit
, a Vega-Lite "unit" specification. When a user drops a field over a mark's dropzone, a corresponding entry is added to the mark's _vlUnit
, which is then compiled by Vega-Lite (parsed.input
) to produce a Vega specification (parsed.output
).
The parse*
modules under bindChannel
then analyze the constituent parts of the output Vega specification and make any needed changes to the Lyra store. So, for example, parseScales
analyzes the scales
property in the output Vega specification. It makes use of some things we know about the structure of Vega specifications produced by Vega-Lite including:
- With unit specifications, the Vega specification contains a group mark within which all other marks, scales, axes, and legends are defined.
- Scales are named for their Vega-Lite channel (so we get scales named
x
,y
,color
, and so forth).
While not every step is necessary for every parser, the overall parsing logic can be broken down into the following steps:
- Find the specific object in the output Vega specification corresponding to the most recent drag-and-drop operation. E.g., find the scale associated with the dropzone the user just interacted with.
- Normalize the result of step (1) such that it reflects the ways properties are set in Lyra. For example, see the comments associated with
parseScales::parse
andparseMarks::rectSpatial
. - Determine whether the Lyra store already contains something similar to the result of step (2). If it does, either leave it alone or update it as appropriate. If it does not, create it.
- Finally, each
_vlUnit
also contains a_map
(exposed asparsed.map
). The purpose of this map is to translate from Vega's identifiers (names) to Lyra's identifiers (numbers). Thus, ifparseScale
finds a scale namedcolor
in the Vega specification, and creates a Lyra scale with ID7
,parsed.map.scales.color = 7;
. Building this map is critical for ensuring that any other references made to parsed objects get wired up correctly. For example, a mark in the Vega specification may say"fill": {"scale": "color", "field": "Origin"}
and we use the_map
to ensure the Lyra mark uses scale ID7
.
So, with aggregation, our first task is to ensure the aggregate
property is correctly set in the Vega-Lite unit specification. When a channel is set to aggregate
, Vega-Lite produces an additional Vega dataset named summary
which inherits data from the source
dataset and then transforms it using the aggregate
transformation. Thus, with this PR, our attention will be focused on fleshing out parseData
and the other necessary infrastructure (action creators, reducer logic, etc.) to change the Lyra store in response.
Let's use a running example with the cars dataset of aggregating mean_Horsepower
on the Y axis, grouped by the Origin
along the X. Users can specify this chart in two ways: (1) Drag Origin
first, and then mean_Horsepower
; (2) Dragging mean_Horsepower
first, and then Origin
. In the first case, a summary
dataset is only produced after the second drag-and-drop, and its aggregate transform definition will be complete (i.e., it will contain non-empty groupby
and summarize
properties). In the second case, a summary
dataset is produced right away; its aggregate transform will contain a summarize
but an empty groupby
.
We should be able to support both workflows, but the first provides us a good starting point. When we experience a summary
dataset, we need to mimic the flow highlighted above including checking if a matching dataset already exists in Lyra, updating/creating it as necessary, and then populating the parsed.map
.
How do we check if Lyra already has a matching dataset? One option would be similar to the existing parsers -- iterate through all the datasets and perform a comparison. However, we can do one better and see if the associated pipeline has a matching aggregated dataset. The key identifier for aggregated datasets is the fields they groupby
. If we find an aggregated dataset with the necessary groupby
fields, we can just add to its summarize
property. Thus, our pipelines need to track aggregated datasets, building a map to go from a string of groupby
fields to the ID of the Lyra datasets (pipeline._aggregates = {}
).
If we cannot find a matching dataset, we need to create one. The flow for this should look similar to the addPipeline
flow -- the action should be on the pipeline (e.g., aggregatePipeline
) which should in turn dispatch the necessary dataset actions (i.e., addDataset
). As the aggregated dataset is derived from the source, we do not need to provide any raw values
to the action creator. We do, however, have to manually construct a schema
. The groupby
fields will have the same type definition as found in the source dataset; and then we will get a new field for each summary operator that will be a quantitative number. So, going back to our running example, we might see this aggregate transform:
"transform": [{
"type": "aggregate",
"groupby": ["Origin"],
"summarize": {"Horsepower": ["mean"]}
}]
Our schema would include Origin
(taken from the source's schema), and then mean_Horsepower
that we manually construct (notice that aggregated fields are named in the form of aggop_sourcefieldname
:
schema = {
Origin: {
name: "Origin",
type: "string",
mtype: "nominal"
},
mean_Horsepower: {
name: "mean_Horsepower",
type: "number",
mtype: "quantitative"
}
}
Since aggregatePipeline
will call addDataset
directly, we can pass in the transform
definitions directly via props
rather than adding additional transform actions for now.
Once the dataset has been updated/created, do not forget to add an entry to parsed.map
to track the ID of the summary
dataset. Downstream parsers (i.e., scales and marks) which reference "data": "summary"
will use this map to lookup the correct Lyra ID. If you forget this step, your output Vega specification will have a lot of "data": undefined
in there.
The above flow sketches out the changes we need to make to bindChannel
and related infrastructure to begin to support aggregated fields. We will also want to make necessary changes to our interface. The first such change will be to display all of a pipeline's aggregated datasets as DataTables underneath the source. Fields should be coloured yellow (using the .derived
class rather than .source
) rather than green. We can demarcate each aggregated dataset with a grey bar (and relabel the existing one as Source
perhaps).
Once this is done, we should also provide a simple inline property inspector to allow users to inspect/update aggregates manually (i.e., it should list the groupby
fields and aggregated fields, and users can drag additional fields from the source data table into the property inspector). This latter drag-and-drop functionality should be implemented in a generalizable way to allow dropping of any fields on any <Property />
element.
Based on your most recent commits, @AnjirHossain, here's some additional clarification. If I import the Cars dataset, I'll see a dataset with ID 4 in the store that looks like the following:
{
"_id": 4,
"_parent": 3,
"name": "cars",
"url": "/data/cars.json",
"format": {
"parse": "auto",
"type": "json"
}
}
It's pipeline has ID 3 and the following in the store:
{
"_id": 3,
"_source": 4,
"name": "cars"
}
Now, if I drag out Origin
and mean_Horsepower
, my store will look like so (IDs may vary):
"pipelines": {
3: {
"_id": 3,
"_source": 4,
"name": "cars",
"_aggregates": {
"Origin": 5
}
}
},
"datasets": {
4: {...},
5: {
"_id": 5,
"_parent": 3,
"name": "cars_groupby_Origin",
"source": 4,
"transform": [{
"type": "aggregate",
"groupby": ["Origin"],
"summarize": {"Horsepower": ["mean"]}
}]
}
}
And parsed.map.data.summary
should equal 5
. Hope this helps!