Replace nominal/ordinal with categorical to avoid confusion.
Our ordinal type has long history since the original version.
I've feeling more and more that it's a mistake.
The only thing "ordinal" really does in the whole compiler is we use "ordinal" color scheme instead of "category" when we use it with color scale.
More importantly, the keyword alone is incomplete because the keyword "ordinal" alone doesn't tell the order. Users still have to specify custom sort order (e.g., "sort": ["small", "medium", "large"]) anyway.
Basically there is also no difference between:
x: {field: 'size', type: 'nominal', sort: ["small", "medium", "large"]}
and
x: {field: 'size', type: 'ordinal', sort: ["small", "medium", "large"]}
So having "ordinal" in the language is simply just confusing because the keyword doesn't do anything related to order.
In a way, some books even just call nominal "unordered categorical" and ordinal "ordered categorical".
I think we should consider deprecating it in VL 5.0 (with backward compatibility so we don't break people's code) so we can simplify both the documentation and the internal code.
For internal code, there are several places that we have to check if the type is either nominal or ordinal, even the ordinal bit is never useful beyond the color range.
This is also related to the discussion about adding cyclical type https://github.com/vega/vega-lite/issues/6590
cc: @jheer @arvind @domoritz
How would one quickly switch between ordered and not ordered colors?
Should we also name nominal categorical instead since it’s a more common word?
From my Vega developer perspective: I think ordinal still has value -- primarily for ordinal color scales. I would also assume that a default order (numerical or lexicographic) is applied if an explicit sort is not provided. I think this kind of breaking change (see also: renaming nominal to categorical) causes more trouble than it helps, while also diverting energy away from arguably more important updates.
From my Data Vis teacher perspective: The language VL currently uses aligns with the common terminology taught in many Vis courses and used in most Vis research papers. As a result it also aligns with the VL-based notebook curriculum we put a lot of time and effort into.
I agree with Jeff's position here. Conceptually 'ordinal' is an important category, with practical implications for colour encoding. Its current name provides a common reference with other well-established descriptions of data measurement types in literature and other systems.
(As an aside, I think it a pity that we've lost the distinction between 'interval' and 'ratio' quantitative scales as this too has implications for colour encoding and default zero baselines. But I think the horse has bolted on that one.)
The practical implications for color encoding are a fine reason to keep the ordinal type, but is there a reason to keep ordinal and not add a ratio/interval distinction or the cyclical flag?
Internally, we switched to "unordered categorical" and ordinal "ordered categorical," partially for this reason, and partially because I think there is also a distinction between order as metadata on the field and sort as a parameter of the query/encoding
In addition to the academic arguments for classifying measurement scales, I guess the reasons are governed by
- (i) priorities of the core Vega/Vega-Lite team;
- (ii) the cost of dispensing with legacy labels;
- (iii) learning affordances/barriers of using a particular set of labels;
- (iv) in what ways labels might change default encoding.
Under (iv), a couple of obvious defaults that would distinguish interval from ratio would be to have ratio scales use sequential colour schemes that start with a white point (or black point if dark->light) whereas interval colour schemes start part way along. I would also set scale zero to be true by default for ratio and false for interval. Both can already be achieved with appropriate specs, but it might be nice to have defaults in there.
However weighed up against this are (i) and (ii) above, which are probably questions for the core team.
Great discussion everyone. Here are some replies for subtopics:
(@jheer) I would also assume that a default order (numerical or lexicographic) is applied if an explicit sort is not provided.
Yes, but we do the same for nominal too. So the ordinal bit still doesn't mean anything here.
(@domoritz) Should we also name nominal categorical instead since it’s a more common word?
If we start today, I'd prefer that in hindsight too.
But I could see that a breaking change is an argument against this.
(@willium) Is there a reason to keep ordinal and not add [...] the cyclical flag?
Syntax-wise, I think the main benefit of ordinal type is that it has one level less nesting (type: "ordinal") vs (scale: {range: 'ordinal'}).
Since "cyclical" could be either discrete, quantitative, or temporal. I don't think if we want to add 3 new types: cyclical-quantitative, cyclical-quantitative, or whatever we call the cyclical discrete.
A scale flag doesn't help with one level less nesting (scale: {cyclical: true) vs (scale: {range: 'cyclical'}), so I'd rather not adding another way of doing the same thing, when it's only one character shorter ("true" vs "range").
@jheer @jwoLondon -- I agree that ordinal itself aligns well with our teaching material and I agree that breaking changes ~~can be annoying~~ should be avoided.
Here are my further reflection.
(1) As @jwoLondon mentioned, our type system also doesn't perfectly align with levels/scales of measurement taught in our vis courses because we don't have interval/ratio but instead have temporal/quantitative. It's worth noting that not all interval data are temporal. But "temporal" is the most common interval data. We also need to keep temporal since it signals the use of time scale.
(2) These levels of measurement (nominal/ordinal/interval/ratio) are stats jargons, and thus are less accessible to our broader audience who may not have taken a vis/stats class. Some intro to stats courses also don't teach scale of measurement.
-
That said, I think nominal and ordinal are at least pretty descriptive, so one shouldn't have to take a class to understand them. (Though categorical is still a more common term. Other vis tools like Tableau/Excel also use the term.)
-
However, interval scale/type is a bit ambiguous term because interval normally means a time period or a range between two points. Time interval actually has a "ratio" type (e.g., 2 minutes is 2x of a minute). I also normally think that a value with interval type (like a date/time) is also a point not an interval between two points. So the term "interval type" is a pretty confusing jargon, even if one has learnt about level of measurement.
-
Ratio isn't as confusing as interval, but quantitative is still more accessible than "ratio" because it's more common.
(3) Even if we don't change the public API, we could (and perhaps should) still simplify internal representation in the compile.
We can normalize type: 'ordinal' to just nominal or, for color, type: 'nominal', scale: {range: 'ordinal'} then we can eliminate the check if the type is either nominal or ordinal. (By checking just for the common categorical type.) -- The question is whether we keep the name nominal internally or use the broader term "categorical", which generally could mean either "nominal" or "ordinal".
With these reflections, we could consider these options for the public API:
a) Keep all the types as is.
b) Add "categorical", "ratio", and "interval", so users can choose to use a common term (categorical vs quantitative) or a scale of measurement (nominal/ordinal/interval/ratio). -- This will help us achieve all of (1) consistency with levels of measurement, (2) more accessible vocabulary, and (3) cleaner internal representation. However, one con is that introduce multiple ways to do the same thing.
c) Add just "categorical" if we think "interval" is too confusing, so we can still achieve (2) and (3).
All of these options won't introduce a breaking change.
Besides echo'ing points made by @jheer and @jwoLondon about conceptual/teaching alignment that I would be very disappointed to lose, my other concern about this proposal is it seems like the work it would take to implement it, and the subsequent churn it would generate across the broader ecosystem (Altair docs, 3rd party tutorials, our teaching materials, etc.) does not feel commensurate with the benefit gained.
Is there sufficiently significant technical debt already accumulated behind the "ordinal" data type (or are we anticipating that keeping it around will accrue non-trivial debt in the future) to warrant this change? Or are we primarily considering it because the upcoming 5.0 release affords the opportunity to make large-scale/breaking changes?
I think internal distinctions are a different discussion (or at least should be) from discussions of the external API. I also don't follow your response to my comment above. Whether or not we sort nominal by default doesn't seem all that relevant to me. The relevant parts are (a) that we do sort by default for ordinal, and (b) that we end up with different default outputs in the relevant cases (e.g., categorical vs. ordinal color) when using nominal versus ordinal types.
In addition, I would wager that "breaking changes can be annoying" is probably a massive understatement for large swaths of the user community -- particularly when it concerns a fundamental piece of the VL grammar.
Also, +100 to @arvind's comment above.
Personally I am generally in favour of using breaking changes only as an exceptional last resort, and the costs of losing Ordinal greatly outweigh the (questionable) benefits of doing so. Adding overlapping alternatives to the API like 'categorical' would probably add more confusion than clarity. FWIW, I'd vote on keeping the existing labels in Vega-Lite.
I would question the reasoning that synonymous use of terms like 'interval' are the block in people's understanding though. After all, the more general use of the term 'quantitative' (as in 'countable') covers many 'ordinal' data too. I think the confusion arises (along with @kanitw 's point about 'interval' also meaning an time interval) because students often fail to realise that each measurement scale property applies not only to the named scale, but also to the others 'later' in the sequence (so ratio data also allow intervals to be derived and are also orderable; interval data are also orderable; ordinal data are also identifiable by name). No amount of renaming will solve that problem.
BTW, the other common interval data example that students often stumble on is temperature, especially with respect to zero axis baseline.
@arvind @jheer @jwoLondon -- Let me reiterate my last comment.
With these reflections, we could consider these options for the public API: a) Keep all the types as is. b) Add "categorical", "ratio", and "interval", so users can choose to use a common term (categorical vs quantitative) or a scale of measurement (nominal/ordinal/interval/ratio). -- This will help us achieve all of (1) consistency with levels of measurement, (2) more accessible vocabulary, and (3) cleaner internal representation. However, one con is that introduce multiple ways to do the same thing. c) Add just "categorical" if we think "interval" is too confusing, so we can still achieve (2) and (3). All of these options won't introduce a breaking change.
I'm not discussing any breaking changes at this point because I agree it's bad. I just asked whether adding some new types and promote them more than old ones may resolve some problems in the designs in we have.
To clarify, I'm totally ok with doing nothing. After all, (1), (2), and (3) are issues that I might be ok to live with. It just bugs me that we never discuss them.
Given we have upcoming 5.0 as an opportunity to improve the main part of the language, at least we should acknowledge some problems (like inconsistency with level of measurements) and agree on action plan (even if the plan is to do nothing).
Right now, I'm leaning toward keeping the status quo given the interval term is probably too confusing anyway.
That said, we should look into requiring type less https://github.com/vega/vega-lite/issues/6636.
FWIW, as I make type optional and remove unnecessary ones, I notice that "ordinal" is a source of confusion even among our team. For example, there are several examples that we incorrectly use "ordinal" with cars origin even though they should just be "nominal".
Fortunately, these incorrect specs are being fixed because we no longer need to specify types in many of these instances. (So it's ok to keep existing type system.)
Seeing #7654, I still feel like the option c) is probably the right long term solution.
After all, I think the teaching argument that we need to align with steven's ratio of measurement is less convincing because we don't align ratio/interval with our current type system anyway. It's simpler to teach that we have: categorical, temporal, quantitative as the 3 primary data types (+ geojson for maps). No one would get confused.