vega-lite icon indicating copy to clipboard operation
vega-lite copied to clipboard

Composite Histogram Mark

Open kanitw opened this issue 6 years ago • 21 comments
trafficstars

The fact that Vega-Lite can express histogram with building blocks like bin and aggregate is very powerful.

However, for quick EDA, just saying histogram is more convenient just like ggplot2 provides geom_hist().

We could imagine the following syntax as a shorthand for the full histogram example

{
  "mark": "histogram",
  "encoding": {"x": {"field": "IMDB_Rating", "type": "quantitative"}} 
}

This could be extremely handy in Altair as we can imagine:

alt.Chart(movies).mark_histogram().encode(
    x='IMDB_Rating'
)

We can similarly add color to make a stacked histogram.

The following also can be as shorthand for the circle_binned example (binned scatterplot 2D histogram).

{
  "mark": "histogram",
  "encoding": {
    "x": {"field": "IMDB_Rating", "type": "quantitative"}, 
    "y": {"field": "Rotten_Tomatoes_Rating", "type": "quantitative"}
  } 
}

kanitw avatar Dec 20 '18 22:12 kanitw

The problem with the second example is that it's not clear whether we should use circles or color to encodes the count.

domoritz avatar Dec 21 '18 01:12 domoritz

Yep, but we can make it configurable.

kanitw avatar Dec 21 '18 01:12 kanitw

This would also work great in VegaLite.jl:

data |> @vlplot(:histogram, x=:IMDB_Rating)

would be enough to create a histogram, and I think I would not even have to update anything on the julia side to enable this.

davidanthoff avatar Dec 21 '18 01:12 davidanthoff

For the customization 2D plots, we need a good property name for it.

bivariateMark: "rect" | "circle" | "square" is one option.

We also need to determine what should be the default.

Circle + size is better for comparing the count, but can be tricky to handle if the number of bins on x and y do not match.

kanitw avatar Dec 21 '18 02:12 kanitw

Circles are also better if you want to show the unfiltered count like in falcon.

domoritz avatar Dec 21 '18 06:12 domoritz

Isn't the 2d histogram rather a variant of heatmap or scatter plot rather than of a histogram?

g3o2 avatar Dec 21 '18 12:12 g3o2

In other words, maybe having both types under one hood maybe a little too much magic?

g3o2 avatar Dec 21 '18 12:12 g3o2

Isn't the 2d histogram rather a variant of heatmap or scatter plot rather than of a histogram?

Histograms and heatmaps are both instances of binned aggregation.

domoritz avatar Dec 21 '18 16:12 domoritz

They are, yet, the way that each of them encodes this information visually is different.

Personally, I find the following two ideas confusing:

  • histogram mark would require to specify only the x or y channel, while the other positional channel would be silently used to encode the frequency;
  • when the two positional channels would be specified by the user, the histogram would be replaced by a binned scatterplot.

g3o2 avatar Dec 21 '18 18:12 g3o2

In both cases, the user intention is to create a histogram (either 1D or 2D), so I don't think this is confusing if you consider binned scatterplots / heatmaps as 2D histograms.

In fact, there are histogram2d() in numpy and hist2d() in pyplot.

Having both 1D / 2D histograms in the same composite mark can facilitate transition between the two.

kanitw avatar Dec 21 '18 18:12 kanitw

Some more benefits of this:

It would be very easy to add an option for a normalized histogram (which currently is really verbose to create). I guess it could just be an option on the mark?

It would be great if we could also add the equivalent density mark type, and then the combination of the two would make it super easy to create a combined histogram with a density line on top of it. It could work something like this:

{
  "encoding": {"x": {"field": "IMDB_Rating", "type": "quantitative"}},
  "layer": [{"mark": {"type": "histogram", "normalize": true}}, {"mark": "density"}]
}

For the Julia version this would (automatically) amount to data |> @vlplot(x=:IMDB_Rating) + @vlplot({:histogram, normalize=true}) + @vlplot(:density), which is starting to get competitive with ggplot syntax.

From the Julia side of things, this kind of issue is where we get the most unhappy feedback from users: very simple and common plots are too verbose (like, say this example of a layered normalized histogram with a density on top). I hardly get any feedback that in general there are major types of plots missing (well, apart from 3D...), or that some corner cases are too complicated, it is really these super common scenarios that are so verbose that are the biggest barrier for users. I think the other example like this is the lack of a composite mark for a regression/loess line (this is another super common plot: scatter with a line on top, super verbose right now with vega-lite because one has to deal with transforms etc.). I totally understand that your priorities here in the repo reflect other user groups as well, just wanted to give an update on the situation on the Julia side. Wondering whether the situation is similar on the altair side of things? CC @jakevdp, the issue https://github.com/altair-viz/altair/issues/947 seems suggest that maybe it would be appreciated there as well?

davidanthoff avatar Jun 02 '20 21:06 davidanthoff

I completely agree with everything you said. Quick charts would really help I’m data science scenarios. We’ll see how we can fit it into the next work sprint.

What’s a normalized histogram? Is that a chart where the area is 1?

domoritz avatar Jun 03 '20 02:06 domoritz

What’s a normalized histogram? Is that a chart where the area is 1?

Yes, the sum of the height of all the bars sum to 1, i.e. the y axis is a relative frequency. At that point one can nicely layer a histogram and a density because they use the same y axis.

I have to admit I don't really understand why folks are so keen to layer a density over a histogram, the two types of plots seem to show exactly the same information to me, but it is something I have seen users ask a lot for :)

davidanthoff avatar Jun 03 '20 03:06 davidanthoff

Let's think about the design. I think I'm fairly happy with the 1D case so maybe we do that one for now.

Standard histogram.

{
  "data": {"url": "movies.csv"},
  "mark": "histogram",
  "encoding": {
    "x": {"field": "IMDB_Rating", "type": "quantitative"}
  } 
}

Standard histogram with custom number of bins.

{
  "data": {"url": "movies.csv"},
  "mark": "histogram",
  "encoding": {
    "x": {
     "field": "IMDB_Rating", "type": "quantitative",
     "bin": {"maxbins": 20}
    }
  } 
}

I considered as an alternative that we could write

{
  "data": {"url": "movies.csv"},
  "mark": {
    "type": "histogram",
    "maxbins": 10
  },
  "encoding": {
    "x": {"field": "IMDB_Rating", "type": "quantitative"}
  } 
}

but the issue is that you won't be able to customize a 2D histogram.

Normalized histogram (cc @davidanthoff)

{
  "data": {"url": "movies.csv"},
  "mark": {
    "type": "histogram",
    "normalize": true
  },
  "encoding": {
    "x": {"field": "IMDB_Rating", "type": "quantitative"}
  } 
}

2D histogram (using size for count and circle marks)

Maybe we do this later since we are not sure about the design below yet.

{
  "data": {"url": "movies.csv"},
  "mark": "histogram",
  "encoding": {
    "x": {"field": "IMDB_Rating", "type": "quantitative"}, 
    "y": {"field": "Rotten_Tomatoes_Rating", "type": "quantitative"}
  } 
}

2D histogram with a custom mark

I'm not happy about the mark here so maybe we don't allow customization for now.

{
  "data": {"url": "movies.csv"},
  "mark": {
    "type": "histogram",
    "mark": "rect"
  },
  "encoding": {
    "x": {"field": "IMDB_Rating", "type": "quantitative"}, 
    "y": {"field": "Rotten_Tomatoes_Rating", "type": "quantitative"}
  } 
}

2D histogram with a custom count encoding

I'm not happy about the mark here so maybe we don't allow customization for now.

{
  "data": {"url": "movies.csv"},
  "mark": {
    "type": "histogram",
    "mark": "rect",
    "count_encoding": "color"  // bad name, I know...
  },
  "encoding": {
    "x": {"field": "IMDB_Rating", "type": "quantitative"}, 
    "y": {"field": "Rotten_Tomatoes_Rating", "type": "quantitative"}
  } 
}

Alternative (uses rect and color). I like this a lot better.

{
  "data": {"url": "movies.csv"},
  "mark": {
    "type": "heatmap"
  },
  "encoding": {
    "x": {"field": "IMDB_Rating", "type": "quantitative"}, 
    "y": {"field": "Rotten_Tomatoes_Rating", "type": "quantitative"}
  } 
}

domoritz avatar Jun 05 '20 16:06 domoritz

These all look very reasonable to me :)

I think an important aspect is that the API should be aligned with one for a density mark, so it might make sense to play that scenario through as well, just to make sure there is symmetry between the two?

davidanthoff avatar Jun 08 '20 00:06 davidanthoff

Thanks for working on this! I am excited to hear that more of the common statistical plots might become easier to make in vegalite/altair! Similar to what was mentioned above, I also receive many request for shortcuts/quickmarks for these operations (when teaching altair), especially when showing altair side by side with ggplot, which has shortcut geoms for these. I think the addition of the boxplot mark was really great, and for future stat quick marks I think that particularly densities and violins would be great since these are rather verbose to create, especially when there are multiple groupbys in the visualization(color, facet, etc).

joelostblom avatar Oct 18 '20 16:10 joelostblom

Thank you for the feedback. I agree that shortcuts for common density visualizations would be really helpful. We're currently focussing on a revamp of interactions and will then look more into this feature.

domoritz avatar Oct 18 '20 17:10 domoritz

Bump, just wondering whether there is any chance this might move forward? I think we're close to using ggplot2 for another paper because we can't figure out how to layer a histogram with a density plot in the same plot with vega-lite :( The tricky thing is that creating a normalized histogram is so difficult, which the above proposal would solve.

davidanthoff avatar Nov 23 '21 00:11 davidanthoff

Thanks for bumping the issue. I don't have any timeline but I will consider this as a potential project for one of my research assistants. Adding a label to help with that.

domoritz avatar Nov 23 '21 03:11 domoritz

Bump, just wondering whether there is any chance this might move forward? I think we're close to using ggplot2 for another paper because we can't figure out how to layer a histogram with a density plot in the same plot with vega-lite :( The tricky thing is that creating a normalized histogram is so difficult, which the above proposal would solve.

As a very late reply to this, although normalizing a histogram is tedious, it is quite straightforward to "unnormalize" a density calculation with counts=True which buts it on a similar scale as a histogram with the default bin count. One drawback is that you would need to scale it manually if you change the bin count of the histogram.

image

Open the Chart in the Vega Editor

joelostblom avatar May 29 '23 21:05 joelostblom

+1 for this feature.

giladturok avatar Jan 15 '24 17:01 giladturok