fix: add explicit option to control how densities are resolved
Will add more description soon, just seeing if tests pass first because my local installation doesn't work.
I got stuck trying to fix the issues with density area marks, so I thought I would summarize what I'm trying to do so that others can help:
I believe the recent issues with density marks stem from PR https://github.com/vega/vega-lite/pull/9018, which made stacking the new default for area marks. Unfortunately stacked area marks also perform an imputation to zero by default as described here https://github.com/vega/vega-lite/issues/5611#issuecomment-561039336 by @domoritz which leads to issues with stacked density as previously noted in https://github.com/vega/vega-lite/issues/6624 (this became more obvious after the PR mentioned above since it made stacked areas the default). Thanks @mattijn for pointing out that this imputation was happening in https://github.com/vega/vega/issues/3815.
The reason the imputation leads to jagged densities is because grouped densities value are defined only for data values existing within their own group. So when putting two grouped densities together on the same axis, but without exactly the same extent over which the density is defined will lead to a lot of imputation to zero and spiky appearance. In https://github.com/vega/vega-lite/pull/9106 @jonmmease fixed this by guaranteeing the same extent for all individual grouped densities. However this introduced a new issue, which is that densities are no longer cut off at the min/max values of the data within each group, but extends to the min/max values of the data within all groups leading to the appearance of values where there actually are none as shown in https://github.com/vega/vega/issues/3815.
For density area marks to work properly, we need these two things:
- Grouped density transforms should be limited to the extent of the data within each group to not avoid suggesting non-existing values.
- This can be fixed if we go back to setting
"resolve": "indepedent""in the density transform. If we do that, we need area marks used for densities to not be imputed to zero by default. I think the current zero imputation is not ideal for non-density areas either, sine it gives the impression of non-existing values as can be seen in . Maybe the suggestion in https://github.com/vega/vega-lite/issues/5611#issuecomment-561039488 from @jheer to use Vega'sxcchannel instead of the stack could help? - Another way of fixing this would be to keep
"resolve": "shared", but we trim the densities after. Note that we can't simply trim values where the density is 0, but need to trim all values outside the original data range for each group (because we need access to each group's min/max values, I believe these needs to happen inside the KDE transform as suggested in https://github.com/vega/vega/issues/3815).
- This can be fixed if we go back to setting
- By default area marks used for densities should not be stacked since it makes it harder to compare the distributions with each other.
- I'm unsure how to solve this in a way that is compatible with the desire to stack other area marks by default as per https://github.com/vega/vega-lite/pull/9018. We can't simply always turn off stacking for density marks since that would not work together with use cases when stacking is explicitly desired. Personally, I would be in favor of rolling back #9018 since I think comparing distributions via densities is more common than the use case described for that PR, but if there is a way to have both that's of course ideal.
My hunch would be that limiting the range of each grouped density in the KDE transform in Vega and rolling back #9018, would be the most straightforward way to fix this, but I'm very open to hearing other suggestions.
Here is another example of how the current behavior leads to unexpected outcomes. Take this spec that creates two densities which seem to cover the same data range:
Open the Chart in the Vega Editor
If we add another chart to the layer, such as a mean line (but really any chart), suddenly it becomes apparent that the densities don't at all cover the same region and the one to the right now appears cut off for no reason since it looked like the extent was the same in the previous chart:
Open the Chart in the Vega Editor
The reason seems to be that the layering is moving the transforms around, but this is not easy to understand and the inconsistent behavior is quite confusing.
Gentle ping on this @jheer, @domoritz, and @kanitw. What do you think is the best way forward to fix the density specs? The decision we take here will also have an impact on the violin plots we have been talking about.
My hunch would be that limiting the range of each grouped density in the KDE transform in Vega as suggested in https://github.com/vega/vega/issues/3815, and rolling back https://github.com/vega/vega-lite/pull/9018 would be the most straightforward way to fix this, but I'm very open to hearing other suggestions.
If densities should be handled differently from other area marks when it comes to stacking, I think it would make sense to introduce enough knobs to support handling them differently and then to introduce a higher-level mark type (which could even create the density transform as well). I know @jheer was advocating for that for a long time 😅 so I'd say let's use this opportunity to do that.
Imputation by default has issues but I still think it's a reasonable choice (as discussed in https://github.com/vega/vega-lite/issues/5611). Would introducing xc as a channel as an alternative to centering a stacked area mark help as well? It would make for a more canonical representation of densities (https://vega.github.io/vega/examples/violin-plot/), I think.
@yhoonkim and @kanitw since you introduced https://github.com/vega/vega-lite/pull/9018, what are your thoughts on the issue/reverting?
At this point, I am a bit lost between the discussion here and slack. Would it make sense to have a sync meeting where we can agree on a plan with @kanitw @yhoonkim @joelostblom @jheer and @domoritz?
Outcome from meeting: let's get this option added but not change the default for now.
@domoritz I updated this as per our discussion to change the default back to "shared", but now it's tunable which gives us more flexibility for designing the density mark
Thank. I will take a look. Can you merge the latest main and make sure that the CI runs?
@domoritz Thanks! Merge complete and CI ran successfully.