vega-lite icon indicating copy to clipboard operation
vega-lite copied to clipboard

Support violin plot and probability density plots

Open kanitw opened this issue 6 years ago • 33 comments

From https://vega.github.io/vega/examples/violin-plot/

A violin plot visualizes a distribution of quantitative values as a continuous approximation of the probability density function, computed using kernel density estimation (KDE). The densities are additionally annotated with the median value and interquartile range, shown as black lines. Violin plots can be more informative than classical box plots.

https://vega.github.io/vega/examples/probability-density/ is another related example

  • [ ] Understand https://vega.github.io/vega/examples/violin-plot/ and https://vega.github.io/vega/examples/probability-density/ examples throughly, search online to understand other violin and density plot variants, and define the scope that we want to support.

  • [ ] Understand how we implement composite mark thoroughly by looking at the [box-plot codebase](https://github.com/vega/vega-lite/blob/master/src/compositemark/boxplot.ts. (By summer, we should have reasonable error-bar example as well.)

  • [ ] Design density transform in Vega-Lite and see if we can already use area mark to reproduce the density area for violin.

  • [ ] Design composite mark syntax for violin (and density plot?)

    • [ ] First we can focus on just the violin area part: design MarkDefinition block for Violin so that we can define property of the underlying density transform and other related properties
    • [ ] Decide if we need a composite mark for density plot -- (probably yes), and make sure that the syntax for violin and density are consistent. (Also think if there is a better name for density too)
    • [ ] For violin plot, we need to decide if we want to include interquartile range and median as a part of the violin composite mark (which is sort of like the "box" overlay on top of violin plot). The syntax here should be very consistent with box-plot.
  • [ ] Implement the code. Note that there is probably a good way to share at least some part of the implementation between the violin and density plot.

kanitw avatar Mar 03 '18 18:03 kanitw

The tricky part about this is that Vega's Violin plot depends on the Vega facet operator to split data into subgroups between passing it to density transform. (Density happens inside nested facet.)

  1. Consider the solution above that suggests implementing density transform first.
    Given VL's facet also always applies layout, we can't reproduce the violin example with axis using implement density as a transform unless we do one of the following:

a) Make Vega density supports groupby (which is basically in place faceting) b) Support a variant of facet without layout (pure facet in the data transformation sense)

Note: we can reproduce violin plot using VL facet operator, but we will then rely on row instead of y position for each violin.

  1. Alternatively, we could consider implementing violin as its own special mark that produce underlying density transform. However, this approach will be less composable. (For example, density plots https://vega.github.io/vega/examples/probability-density/ shouldn't be its own mark but rather using area plotting output from density transforms.)

kanitw avatar May 24 '18 02:05 kanitw

We meet today to talk about this and conclude that we should make Vega density transform supports groupby.

kanitw avatar May 25 '18 22:05 kanitw

We meet today to talk about this and conclude that we should make Vega density transform supports groupby.

@kanitw Any progress on implementation?

HarvsG avatar Dec 02 '18 20:12 HarvsG

No update yet

kanitw avatar Dec 02 '18 21:12 kanitw

Thanks for Vega-lite.

I often use violin plots and I am looking forward to use them in Vega-lite/Altair.

In addition, I use a lot of ridge plots (half violin) like this one:

mcmc_areas-rstanarm

Would you consider adding an option to the violin plot to allow similar figures to be made?

I made an implementation in python, with mark area and a custom kde function, but it is rather tedious.

Also, would similar figures in histogram be possible (for discrete variable)?

I'm sure anyone using Bayesian statistics would be grateful.

romainmartinez avatar Mar 27 '19 14:03 romainmartinez

Yes, once we have a kde transform in Vega, we can also support ridge plots.

domoritz avatar Mar 27 '19 14:03 domoritz

Yes, once we have a kde transform in Vega, we can also support ridge plots.

Has it already landed in the vega 5.0 (https://vega.github.io/vega/docs/transforms/density/)?

denisshepelin avatar Mar 28 '19 09:03 denisshepelin

We've had this transform for a while but it does not support faceting and that's a deal breaker. We've come to the conclusion that we need a kde transform that has a group by key.

domoritz avatar Mar 29 '19 02:03 domoritz

Depends on https://github.com/vega/vega/pull/1783

domoritz avatar Apr 25 '19 22:04 domoritz

Once the new Vega KDE support lands, I think the first step here is probably to add a new density transform to Vega-Lite that maps to the Vega kde transform, with syntax such as:

{
  density: string; // value field to estimate density for
  groupby?: string[];
  method?: 'pdf' | 'cdf';
  extent?: [number, number];
  bandwidth?: number;
  steps?: number;
  as?: [string, string]
}

I think it should be called density rather than kde, as (1) density is a proper word, not an abbreviation, and (2) I can imagine extending the implementation in the future to fit a normal density (or log-normal, or Poisson, etc) to the input data, not just a kernel density estimate.

jheer avatar Apr 26 '19 01:04 jheer

Maybe method?: 'pdf' | 'cdf'; -> cumulative?: boolean. as should not be optional in Vega-Lite.

domoritz avatar Apr 26 '19 02:04 domoritz

@domoritz I definitely prefer your suggestion of cumulative?: boolean.

Also, when adding violin plots we may want to support multiple scaling options. The default (at present) is that all violins share the same scale based on the sampled density estimates, which of course was a primary motivation for adding the kde transform with groupby support in Vega. We may still also want to support other forms of scaling or normalization.

The reason I'm thinking about this is that, if an explicit bandwidth parameter is not applied, each group will have its bandwidth independently set using an estimation heuristic. This means that each plot has different kernel width, which in turn means that one could have potentially large disparities in how much of the probability mass gets "clipped" when drawing violins only over the domain of observed data values. The tails of the KDE distribution get cut off, such that the total amount of probability mass shown in each violin is unequal. (This issue can still arise with a shared bandwidth parameter, it's just not as extreme.) It may be that the "right" thing to do is add a normalization pass in the KDE transform whenever we have more than one group.

So, I think we might need to do some additional research into the "proper" scaling and trimming of violins. I don't know how carefully other tools have looked at this!

jheer avatar May 03 '19 17:05 jheer

The ggplot violin options page shows that these questions are largely left to end users, with the default being the same as proposed above (without normalization of trimmed density areas):

From https://ggplot2.tidyverse.org/reference/geom_violin.html:

  • trim | If TRUE (default), trim the tails of the violins to the range of the data. If FALSE, don't trim the tails.
  • scale | if "area" (default), all violins have the same area (before trimming the tails). If "count", areas are scaled proportionally to the number of observations. If "width", all violins have the same maximum width.

Note that Vega currently supports options corresponding to ggplot's area and width values for the scale parameter, based on how we configure the scale domain. Our KDE implementation normalizes (divides by the number of data points) to form a proper PDF, so we could support a count option (if desired) by multiplying the estimated density by the count of points within a group. If that is of interest we could update the kde transform accordingly.

jheer avatar May 03 '19 17:05 jheer

@jheer said about implementing violing plots with the new KDE transform in Vega:

The issue is not one of performance or extra transforms, but of correctness. (FWIW, I'd want to avoid a "density-center" option, as that strikes me as confusing and an abstraction-level violation.) The previous Vega violin plot example used stack, and it worked because all densities we scaled independently and so used the full width/height of the scale band. But this independent scaling is misleading and hampers accurate comparison.

The new KDE transform supports groupby, so we can use the output to define the domain of a scale at the top-level, which then scales all the densities in a proper fashion. The result is that different densities have different max width/height. Yet, the stack transform center option only centers the mark relative to the observed height (not the max height among all densities), causing inappropriate, non-uniform center-line offsets for the different densities.

My solution in Vega is to instead use xc/width or yc/height for the violin densities (as well as using xc or yc for the median and IQR annotations). This is simple and correct. A top-level linear scale is used to provide the width / height values.

domoritz avatar May 18 '19 23:05 domoritz

Btw, I run into a "split violin plot" in seaborn. It's definitely worth considering how this fits into our grammar.

image

kanitw avatar May 30 '19 20:05 kanitw

Interesting! An alternative that might be a bit better perceptually could be to directly layer (overlay) the conditional violins (or zero-baseline distribution areas) with some opacity. That would make the value and shape comparisons even more apparent. I hope new VL extensions can also support that, which should hopefully be simpler to specify (or, at least, require less new surface area).

jheer avatar May 30 '19 21:05 jheer

Ridge plots are another alternative for this kind of thing and often work well.

There's a good package for ggplot for generating them.

cmcaine avatar Jul 25 '19 11:07 cmcaine

Looks like ridge plots are supported now (can groupby in density transform), haven't figured out how to pull off violin plots yet though

SamWoolerton avatar Jan 08 '20 23:01 SamWoolerton

@domoritz said my comments were welcome so here you go. Do tell me if this is off topic :)

Basically my feeling about a lot of uncertainty vis these days is you break it into (1) a representation of a distribution (be it analytical or empircal) as a PDF (f(x)), CDF (F(x)), and inverse CDF (F^-1(x)); and (2) mappings of those functions onto visual channels.

Then the question is, is there a mark/geom (probably closest is area in vega-lite, though it might not be quite the right one---can you map a continuous variable onto color in an area?) that lets you use those mappings to create densities, violins, gradient plots, CDF barplots, etc. FWIW, I made a "slab" geom for doing this in tidybayes on top of ggplot (and a composite "slabinterval", which is a slab combined with an interval). All of the geoms below (except the dotplots) are just shortcuts for different variants of the underlying slab+interval geom:

image

It's a bit different from how area works in either ggplot or vega-lite in that, because it is not intended for stacking, it does not use the "y" aesthetic/channel for the height of the slab; rather it uses "thickness" (or I suppose you could call it "width" but that already has another meaning in ggplot). This allows you to map a different variable to the y axis to easily create ridge plots / half-eye densities / etc where you would normally use intervals, without having to screw around with creating facets (this is incredibly useful for visualizing coefficients and the like, because creating facets just for coefficients is a pain --- you have to mess with header text angle usually --- plus often you want to facet over something else). It also allows color and opacity to vary within the geom, which is useful for creating gradient plots and for creating densities with highlighted regions.

Anyway the upshot is, if you think abstract grammar-of-graphics mappings from data onto channels (so, not about the particular syntax of a given package, but a formal description of the visualization: "z -> x position" being the equivalent of aes(x = z) in ggplot or an encoding of {"x": "z"} in vega-lite), you might have a density plot for a variable z described as something like this:

z -> x position f(z) -> thickness

or a gradient plot described as:

z -> x position f(z) -> opacity

or a CCDF barplot described as:

z -> x position 1 - F(z) -> thickness

If you then add in the ability to do densities / CDFs / etc of analytical distributions (which is what the stat_dist_slab geom does), you can do the equivalent of:

z -> x position f_Normal(z|mu, sigma) -> thickness

Which is how you'd do a density plot for a normal distribution. Given an implementation of the Normal and the scaled-and-shifted t distribution you'd be able to do confidence distributions for a lot of common ways of summarizing uncertainty from frequentist models (so that gets you, basically, halfeyes / gradient plots / whatever else for visualizing uncertainty).

Last bit is being able to map color within slabs means given a data table roughly like this:

dist theta
normal [0,1]
student_t [3,0,1]

You can do stuff like:

x -> x position dist -> y position f_{dist}(x|theta) -> thickness |x| < 1.5 -> fill color

Which yields something like this:

image

Anyway, I don't have specific suggestions for how these abstract specifications turn into syntax necessarily. What I did with slabinterval doesn't look exactly like the above abstract syntax, but I have found it helpful for thinking more formally about these visualization types.

mjskay avatar Apr 30 '20 02:04 mjskay

@mjskay -- Your comment is definitely very useful.

When we work more on this, we'll have to see how this interplay with offset channel that we plan to add (#4703).

kanitw avatar Apr 30 '20 18:04 kanitw

That's a good point --- having a different channel for thickness (rather than x/y) was partly motivated by how dodging works in ggplot (which is what offset is for in vega-lite?) because it makes it easy to do stuff like this:

image

which is pretty common when visualizing estimates from groups/subgroups

mjskay avatar Apr 30 '20 19:04 mjskay

Although there is no dedicated mark for this yet I noticed that #5066 has been implemented so is is it possible to manually map the area width/height to the density value instead of dedicating one of the axes to this? I would like to make a plot where the y-axis is categorical with one density per y-value and then also facet this plot, so I can't use the trick in the altair gallery where the facets essentially replace the y-axis. Like the boxplot below, but with violins/ridges/densities:

image

For now I am using a binned mark point with the size set to count to approximate a stepwise distribution, which looks pretty cool but is not very formal =) At least it captures multimodality better than a box blot.

image

joelostblom avatar Oct 18 '20 15:10 joelostblom

I am planning to use VL/Altair for a course I will be teaching several months from now where we will need to create violinplots. Since it was mentioned in #4384 that density visualization shortctus might see some development after the interactions were revamped, I just wanted to check in if there has been any internal discussion around where on the roadmap adding violin plots might fit in. I am really looking forward to have this together with the new offset channel which already is going to be super helpful on is own, thanks for continuously working on improving VL!

joelostblom avatar Nov 12 '21 16:11 joelostblom

You're very welcome. I'm excited to hear that you are planning a course with Vega-Lite/Altair. Are you using https://github.com/uwdata/visualization-curriculum?

Density visualizations were the next big thing I wanted to work on for Vega-Lite but I didn't get to it so there is no planned release date.

domoritz avatar Nov 12 '21 19:11 domoritz

Thanks for the update! Yes I will be mixing from that and a few other courses I have developed previously. This one is going to have more emphasis on comparing distributions for many categories and I am hoping to include options that address the shortcomings of boxplots. Maybe I will try to create something via density plots via faceting, or compute KDEs via Python and use that together with the new offset channel to lay out points as violins, but there will likely be a fair bit of starter code that makes it less intuitive than what mark_violin would.

Edit: Added an example in https://github.com/vega/vega-lite/issues/8067 of how this can be achieved for density clouds in Altair and Vega (but not yet Vega-Lite)

joelostblom avatar Nov 12 '21 21:11 joelostblom

Totally agree. Great to hear that you have ideas for workarounds for now, though.

domoritz avatar Nov 12 '21 22:11 domoritz

FWIW, we have violin plot example in https://observablehq.com/@vega/vega-lite-distribution-plots

kanitw avatar Apr 07 '22 05:04 kanitw

Thanks for adding to this issue, I can add some more thoughts myself too. I think one of the great advantages with a dedicated violin mark would be the ability to use it to easily compare and dissect multiple distributions within the same chart with a relatively simple spec without any transforms, and that is compatible with categorical axes, facets, offsets, and coloring. Something like this:

{
  "config": {"view": {"continuousWidth": 300}},
  "data": {
    "url": "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
  },
  "mark": "violin",
  "encoding": {
    "x": {"field": "tip", "type": "quantitative"},
    "y": {"field": "time"},
    "color": {"field": "smoker"},
    "yOffset": {"field": "smoker"},
    "row": {"field": "sex"}
  }
}

Which works for the boxplot mark and creates this useful visualization:

image

Open the Chart in the Vega Editor

If I understand correctly, one of the main issues is that the area of the density currently needs a dedicated axis for its height. Is it possible that violinplots could piggyback on the same mechanism/graphical channel that boxplots are using to define the height of the box and use it for the height of the density/violin area?

joelostblom avatar Apr 07 '22 06:04 joelostblom

Is it possible that violinplots could piggyback on the same mechanism/graphical channel that boxplots are using to define the height of the box and use it for the height of the density/violin area?

I guess they could. I hadn't thought of this idea since I thought we should have an axis to tell us what the height of the violin means but realize now that's not the case.

domoritz avatar Apr 13 '22 13:04 domoritz

Hi, Just a small bump to check if there was any progress on this ? Thanks,

apraga avatar Feb 18 '23 15:02 apraga