vega-lite icon indicating copy to clipboard operation
vega-lite copied to clipboard

Facet sort order not applied in presence of an aggregate

Open jakevdp opened this issue 4 years ago • 2 comments

The facet sort order works as expected with non-aggregated data (editor):

{
  "data": {"url": "data/cars.json"},
  "mark": "bar",
  "encoding": {
    "row": {
      "type": "nominal",
      "field": "Origin",
      "sort": ["Japan", "Europe", "USA"]
    },
    "x": {"type": "quantitative", "field": "Horsepower"}
  }
}

visualization - 2019-09-03T064930 520

But if an aggregation is added to the x-axis, the order is no longer respected (editor):

{
  "data": {"url": "data/cars.json"},
  "mark": "bar",
  "encoding": {
    "row": {
      "type": "nominal",
      "field": "Origin",
      "sort": ["Japan", "Europe", "USA"]
    },
    "x": {"type": "quantitative", "field": "Horsepower", "aggregate": "mean"}
  }
}

visualization - 2019-09-03T065029 595

Reported in https://github.com/altair-viz/altair/issues/1683

jakevdp avatar Sep 03 '19 13:09 jakevdp

Are there any updates on this issue?

jess-sorensen avatar May 11 '22 20:05 jess-sorensen

Context: altair-viz/altair#3386

Vega editor links with barley yield example dataset faceted bar charts with attempted ordering: non-aggregate || aggregate

As @joelostblom suggested I've had a look at how the Vega spec is produced for both.

Intended sort order:

"sort": [ "Waseca", "Morris", "University Farm",
        "Grand Rapids", "Crookston", "Duluth" ],

Observations

tl;dr: sort order is defined in both aggregate and non-aggregate charts (here, column_site_sort_index), but anything that tries to use that is disregarded in the aggregate version

The sort order appears to come in a data reference block (I don't know the actual term!) name'd data_0 starting at line 616, being a formula type, which sets the intended ordering by site:

{
      "name": "data_0",
      "source": "data-093ece8c35bb2d41094cfb6138ec810b",
      "transform": [
        {
          "type": "formula",
          "expr": "datum[\"site\"]===\"Waseca\" ? 0 : datum[\"site\"]===\"Morris\" ? 1 : datum[\"site\"]===\"University Farm\" ? 2 : datum[\"site\"]===\"Grand Rapids\" ? 3 : datum[\"site\"]===\"Crookston\" ? 4 : datum[\"site\"]===\"Duluth\" ? 5 : 6",
          "as": "column_site_sort_index"
        },

The block differs - the non-aggregate version defines a stack, the aggregate version defines an aggregate. The non-aggregate version includes a sort object with an empty definition: "sort": {"field": [], "order": []}, which seems to have no effect on the rendered chart.

The next block is named column_domain, which seems to have an effect on the ordering of the labels in the non-aggregate version- changing ops: ["max"] to ops: ["exponential"] at line 647 for example changes the labels to "unordered" version, while the columns themselves do not change:

https://github.com/vega/vega-lite/assets/7524620/2151d6f7-0b9e-4db3-80df-851ef8823310

It seems to have no effect on the non-aggregate version as that is already 'unordered'.

Similarly, the column_header block shows a changing of the order of the labels (from the name, column headers?) when changed from eg ascending to descending order at line 699 only for the non-aggregate version:

https://github.com/vega/vega-lite/assets/7524620/2b876eaa-3993-4c99-9820-a3eaf86dca51

The sort order property at line 752 affects the ordering of columns, this time only the bars as opposed to the headings:

https://github.com/vega/vega-lite/assets/7524620/5746c437-2cfb-43ff-be2f-3fc99a3a542f

Theory

Not being familiar with Vega, the following is a guess as to why the behaviour is as it seems:

Is Vega-Lite producing a Vega spec that defines a sort order for the underlying dataset, rather than the produced aggregate?

I am speculating based on: 1) the order in which the blocks appear in the Vega spec and 2) that the sources to which the ordering transform is applied have the same identifier that looks like a hash: data-093ece8c35bb2d41094cfb6138ec810b.

So that source is sorted; and the non-aggregate version uses that sorted source (see marks at 760, using yield_end and yield_start), whereas the aggregate version obviously uses the aggregate (["sum_yield"]).

This might be a red herring as both seem to use the same facet definition:

"from": {
        "facet": {
          "name": "facet",
          "data": "data_0",
          "groupby": ["site"],
          "aggregate": {
            "fields": ["column_site_sort_index"],
            "ops": ["max"],
            "as": ["column_site_sort_index"]
          }
        }

However, if I remove the aggregate block and replace it with stack in the data block, and then change both the y and y2 definitions in marks, as well as the y scale definition, I get the correct ordering... albeit by fundamentally changing the chart!

image

updated spec

bertiebaggio avatar Apr 01 '24 14:04 bertiebaggio