vega-lite
vega-lite copied to clipboard
Support violin plot and probability density plots
From https://vega.github.io/vega/examples/violin-plot/
A violin plot visualizes a distribution of quantitative values as a continuous approximation of the probability density function, computed using kernel density estimation (KDE). The densities are additionally annotated with the median value and interquartile range, shown as black lines. Violin plots can be more informative than classical box plots.
https://vega.github.io/vega/examples/probability-density/ is another related example
-
[ ] Understand https://vega.github.io/vega/examples/violin-plot/ and https://vega.github.io/vega/examples/probability-density/ examples throughly, search online to understand other violin and density plot variants, and define the scope that we want to support.
-
[ ] Understand how we implement composite mark thoroughly by looking at the [box-plot codebase](https://github.com/vega/vega-lite/blob/master/src/compositemark/boxplot.ts. (By summer, we should have reasonable
error-bar
example as well.) -
[ ] Design
density
transform in Vega-Lite and see if we can already use area mark to reproduce the density area for violin. -
[ ] Design composite mark syntax for
violin
(anddensity
plot?)- [ ] First we can focus on just the violin area part: design MarkDefinition block for Violin so that we can define property of the underlying
density
transform and other related properties - [ ] Decide if we need a composite mark for density plot -- (probably yes), and make sure that the syntax for violin and density are consistent. (Also think if there is a better name for
density
too) - [ ] For violin plot, we need to decide if we want to include interquartile range and median as a part of the violin composite mark (which is sort of like the "box" overlay on top of violin plot). The syntax here should be very consistent with box-plot.
- [ ] First we can focus on just the violin area part: design MarkDefinition block for Violin so that we can define property of the underlying
-
[ ] Implement the code. Note that there is probably a good way to share at least some part of the implementation between the violin and density plot.
The tricky part about this is that Vega's Violin plot
depends on the Vega facet operator to split data into subgroups between passing it to density transform. (Density happens inside nested facet.)
- Consider the solution above that suggests implementing density transform first.
Given VL's facet also always applies layout, we can't reproduce the violin example with axis using implement density as a transform unless we do one of the following:
a) Make Vega density supports groupby
(which is basically in place faceting)
b) Support a variant of facet without layout (pure facet
in the data transformation sense)
Note: we can reproduce violin plot using VL facet
operator, but we will then rely on row instead of y position for each violin.
- Alternatively, we could consider implementing violin as its own special mark that produce underlying density transform. However, this approach will be less composable. (For example, density plots https://vega.github.io/vega/examples/probability-density/ shouldn't be its own mark but rather using area plotting output from density transforms.)
We meet today to talk about this and conclude that we should make Vega density transform supports groupby.
We meet today to talk about this and conclude that we should make Vega density transform supports groupby.
@kanitw Any progress on implementation?
No update yet
Thanks for Vega-lite.
I often use violin plots and I am looking forward to use them in Vega-lite/Altair.
In addition, I use a lot of ridge plots (half violin) like this one:
Would you consider adding an option to the violin plot to allow similar figures to be made?
I made an implementation in python, with mark area and a custom kde function, but it is rather tedious.
Also, would similar figures in histogram be possible (for discrete variable)?
I'm sure anyone using Bayesian statistics would be grateful.
Yes, once we have a kde transform in Vega, we can also support ridge plots.
Yes, once we have a kde transform in Vega, we can also support ridge plots.
Has it already landed in the vega 5.0 (https://vega.github.io/vega/docs/transforms/density/)?
We've had this transform for a while but it does not support faceting and that's a deal breaker. We've come to the conclusion that we need a kde transform that has a group by key.
Depends on https://github.com/vega/vega/pull/1783
Once the new Vega KDE support lands, I think the first step here is probably to add a new density
transform to Vega-Lite that maps to the Vega kde
transform, with syntax such as:
{
density: string; // value field to estimate density for
groupby?: string[];
method?: 'pdf' | 'cdf';
extent?: [number, number];
bandwidth?: number;
steps?: number;
as?: [string, string]
}
I think it should be called density
rather than kde
, as (1) density is a proper word, not an abbreviation, and (2) I can imagine extending the implementation in the future to fit a normal density (or log-normal, or Poisson, etc) to the input data, not just a kernel density estimate.
Maybe method?: 'pdf' | 'cdf';
-> cumulative?: boolean
. as
should not be optional in Vega-Lite.
@domoritz I definitely prefer your suggestion of cumulative?: boolean
.
Also, when adding violin plots we may want to support multiple scaling options. The default (at present) is that all violins share the same scale based on the sampled density estimates, which of course was a primary motivation for adding the kde
transform with groupby support in Vega. We may still also want to support other forms of scaling or normalization.
The reason I'm thinking about this is that, if an explicit bandwidth parameter is not applied, each group will have its bandwidth independently set using an estimation heuristic. This means that each plot has different kernel width, which in turn means that one could have potentially large disparities in how much of the probability mass gets "clipped" when drawing violins only over the domain of observed data values. The tails of the KDE distribution get cut off, such that the total amount of probability mass shown in each violin is unequal. (This issue can still arise with a shared bandwidth parameter, it's just not as extreme.) It may be that the "right" thing to do is add a normalization pass in the KDE transform whenever we have more than one group.
So, I think we might need to do some additional research into the "proper" scaling and trimming of violins. I don't know how carefully other tools have looked at this!
The ggplot violin options page shows that these questions are largely left to end users, with the default being the same as proposed above (without normalization of trimmed density areas):
From https://ggplot2.tidyverse.org/reference/geom_violin.html:
- trim | If TRUE (default), trim the tails of the violins to the range of the data. If FALSE, don't trim the tails.
- scale | if "area" (default), all violins have the same area (before trimming the tails). If "count", areas are scaled proportionally to the number of observations. If "width", all violins have the same maximum width.
Note that Vega currently supports options corresponding to ggplot's area
and width
values for the scale
parameter, based on how we configure the scale domain. Our KDE implementation normalizes (divides by the number of data points) to form a proper PDF, so we could support a count
option (if desired) by multiplying the estimated density by the count of points within a group. If that is of interest we could update the kde
transform accordingly.
@jheer said about implementing violing plots with the new KDE transform in Vega:
The issue is not one of performance or extra transforms, but of correctness. (FWIW, I'd want to avoid a "density-center" option, as that strikes me as confusing and an abstraction-level violation.) The previous Vega violin plot example used stack, and it worked because all densities we scaled independently and so used the full width/height of the scale band. But this independent scaling is misleading and hampers accurate comparison.
The new KDE transform supports groupby, so we can use the output to define the domain of a scale at the top-level, which then scales all the densities in a proper fashion. The result is that different densities have different max width/height. Yet, the stack transform
center
option only centers the mark relative to the observed height (not the max height among all densities), causing inappropriate, non-uniform center-line offsets for the different densities.
My solution in Vega is to instead use xc/width or yc/height for the violin densities (as well as using xc or yc for the median and IQR annotations). This is simple and correct. A top-level linear scale is used to provide the width / height values.
Btw, I run into a "split violin plot" in seaborn. It's definitely worth considering how this fits into our grammar.
Interesting! An alternative that might be a bit better perceptually could be to directly layer (overlay) the conditional violins (or zero-baseline distribution areas) with some opacity. That would make the value and shape comparisons even more apparent. I hope new VL extensions can also support that, which should hopefully be simpler to specify (or, at least, require less new surface area).
Ridge plots are another alternative for this kind of thing and often work well.
There's a good package for ggplot for generating them.
Looks like ridge plots are supported now (can groupby in density transform), haven't figured out how to pull off violin plots yet though
@domoritz said my comments were welcome so here you go. Do tell me if this is off topic :)
Basically my feeling about a lot of uncertainty vis these days is you break it into (1) a representation of a distribution (be it analytical or empircal) as a PDF (f(x)), CDF (F(x)), and inverse CDF (F^-1(x)); and (2) mappings of those functions onto visual channels.
Then the question is, is there a mark/geom (probably closest is area in vega-lite, though it might not be quite the right one---can you map a continuous variable onto color in an area?) that lets you use those mappings to create densities, violins, gradient plots, CDF barplots, etc. FWIW, I made a "slab" geom for doing this in tidybayes on top of ggplot (and a composite "slabinterval", which is a slab combined with an interval). All of the geoms below (except the dotplots) are just shortcuts for different variants of the underlying slab+interval geom:
It's a bit different from how area works in either ggplot or vega-lite in that, because it is not intended for stacking, it does not use the "y" aesthetic/channel for the height of the slab; rather it uses "thickness" (or I suppose you could call it "width" but that already has another meaning in ggplot). This allows you to map a different variable to the y axis to easily create ridge plots / half-eye densities / etc where you would normally use intervals, without having to screw around with creating facets (this is incredibly useful for visualizing coefficients and the like, because creating facets just for coefficients is a pain --- you have to mess with header text angle usually --- plus often you want to facet over something else). It also allows color and opacity to vary within the geom, which is useful for creating gradient plots and for creating densities with highlighted regions.
Anyway the upshot is, if you think abstract grammar-of-graphics mappings from data onto channels (so, not about the particular syntax of a given package, but a formal description of the visualization: "z -> x position" being the equivalent of aes(x = z)
in ggplot or an encoding of {"x": "z"}
in vega-lite), you might have a density plot for a variable z described as something like this:
z -> x position f(z) -> thickness
or a gradient plot described as:
z -> x position f(z) -> opacity
or a CCDF barplot described as:
z -> x position 1 - F(z) -> thickness
If you then add in the ability to do densities / CDFs / etc of analytical distributions (which is what the stat_dist_slab geom does), you can do the equivalent of:
z -> x position f_Normal(z|mu, sigma) -> thickness
Which is how you'd do a density plot for a normal distribution. Given an implementation of the Normal and the scaled-and-shifted t distribution you'd be able to do confidence distributions for a lot of common ways of summarizing uncertainty from frequentist models (so that gets you, basically, halfeyes / gradient plots / whatever else for visualizing uncertainty).
Last bit is being able to map color within slabs means given a data table roughly like this:
dist | theta |
---|---|
normal | [0,1] |
student_t | [3,0,1] |
You can do stuff like:
x -> x position dist -> y position f_{dist}(x|theta) -> thickness |x| < 1.5 -> fill color
Which yields something like this:
Anyway, I don't have specific suggestions for how these abstract specifications turn into syntax necessarily. What I did with slabinterval doesn't look exactly like the above abstract syntax, but I have found it helpful for thinking more formally about these visualization types.
@mjskay -- Your comment is definitely very useful.
When we work more on this, we'll have to see how this interplay with offset
channel that we plan to add (#4703).
That's a good point --- having a different channel for thickness (rather than x/y) was partly motivated by how dodging works in ggplot (which is what offset
is for in vega-lite?) because it makes it easy to do stuff like this:
which is pretty common when visualizing estimates from groups/subgroups
Although there is no dedicated mark for this yet I noticed that #5066 has been implemented so is is it possible to manually map the area width/height to the density value instead of dedicating one of the axes to this? I would like to make a plot where the y-axis is categorical with one density per y-value and then also facet this plot, so I can't use the trick in the altair gallery where the facets essentially replace the y-axis. Like the boxplot below, but with violins/ridges/densities:
For now I am using a binned mark point with the size set to count to approximate a stepwise distribution, which looks pretty cool but is not very formal =) At least it captures multimodality better than a box blot.
I am planning to use VL/Altair for a course I will be teaching several months from now where we will need to create violinplots. Since it was mentioned in #4384 that density visualization shortctus might see some development after the interactions were revamped, I just wanted to check in if there has been any internal discussion around where on the roadmap adding violin plots might fit in. I am really looking forward to have this together with the new offset channel which already is going to be super helpful on is own, thanks for continuously working on improving VL!
You're very welcome. I'm excited to hear that you are planning a course with Vega-Lite/Altair. Are you using https://github.com/uwdata/visualization-curriculum?
Density visualizations were the next big thing I wanted to work on for Vega-Lite but I didn't get to it so there is no planned release date.
Thanks for the update! Yes I will be mixing from that and a few other courses I have developed previously. This one is going to have more emphasis on comparing distributions for many categories and I am hoping to include options that address the shortcomings of boxplots. Maybe I will try to create something via density plots via faceting, or compute KDEs via Python and use that together with the new offset channel to lay out points as violins, but there will likely be a fair bit of starter code that makes it less intuitive than what mark_violin
would.
Edit: Added an example in https://github.com/vega/vega-lite/issues/8067 of how this can be achieved for density clouds in Altair and Vega (but not yet Vega-Lite)
Totally agree. Great to hear that you have ideas for workarounds for now, though.
FWIW, we have violin plot example in https://observablehq.com/@vega/vega-lite-distribution-plots
Thanks for adding to this issue, I can add some more thoughts myself too. I think one of the great advantages with a dedicated violin
mark would be the ability to use it to easily compare and dissect multiple distributions within the same chart with a relatively simple spec without any transforms, and that is compatible with categorical axes, facets, offsets, and coloring. Something like this:
{
"config": {"view": {"continuousWidth": 300}},
"data": {
"url": "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
},
"mark": "violin",
"encoding": {
"x": {"field": "tip", "type": "quantitative"},
"y": {"field": "time"},
"color": {"field": "smoker"},
"yOffset": {"field": "smoker"},
"row": {"field": "sex"}
}
}
Which works for the boxplot
mark and creates this useful visualization:
Open the Chart in the Vega Editor
If I understand correctly, one of the main issues is that the area of the density currently needs a dedicated axis for its height. Is it possible that violinplots could piggyback on the same mechanism/graphical channel that boxplots are using to define the height of the box and use it for the height of the density/violin area?
Is it possible that violinplots could piggyback on the same mechanism/graphical channel that boxplots are using to define the height of the box and use it for the height of the density/violin area?
I guess they could. I hadn't thought of this idea since I thought we should have an axis to tell us what the height of the violin means but realize now that's not the case.
Hi, Just a small bump to check if there was any progress on this ? Thanks,