plot icon indicating copy to clipboard operation
plot copied to clipboard

Smarter formatting for year channels?

Open mbostock opened this issue 4 years ago • 5 comments

In a case like this (data), it’d be nice to avoid the commas for the year axis:

Screen Shot 2022-02-19 at 12 54 12 PM
Plot.plot({
  width,
  color: { legend: true },
  marks: [
    Plot.rectY(overview, { x: "Year", y: "Value", fill: "Type", interval: 1 })
  ]
})

Of course, you can do it with x: {tickFormat: ""}, but could Plot figure this out automatically?

Similarly when you do something like title: "Year", it’s a bummer that the automatic formatting for numbers shows a comma. I think we could maybe track a hint that looks for (case-insensitive) “year” and avoids the comma.

mbostock avatar Feb 19 '22 20:02 mbostock

related #355

Fil avatar Feb 19 '22 22:02 Fil

In fact this is one of the first questions that came up in a workshop last Friday. But it was happening in Inputs.table, we hadn't started plotting the data yet. The offending field was called "annee" (French for year).

Fil avatar Feb 20 '22 08:02 Fil

I think this is most common problem/gotcha for those starting with plot. This handling could reduce friction for new users who are not familiar with JavaScript data handling features.

I had posted exactly this question in forums: https://talk.observablehq.com/t/handling-date-column-during-file-import/6333

https://observablehq.com/@arky/disasters-in-south-eastern-asia-1900-2021

arky avatar Mar 09 '22 12:03 arky

Here's a related paper. I wonder if there are implementations of something like this out there in open-source land, it's such a common issue.

eagereyes avatar Mar 09 '22 16:03 eagereyes

Question: a possible strategy would be for the default format to depend on the domain. If the domain is contained in, say, [1500, 2200], set the default formatter to be d => '${d}' rather than Intl.NumberFormat. This heuristic might be slightly surprising in the odd case, but it would fix the very common issue of years being poorly formatted, while still giving nice numbers by default in general.

Fil avatar Jun 08 '22 16:06 Fil

If the domain is contained in, say, [1500, 2200], set the default formatter to be d => '${d}' rather than Intl.NumberFormat.

We should also ensure that all channel values are integers (or at least sample the first ~40? channel values). If we see a fractional value such as 1500.1231041240 we should use the default number format rather than assuming it is a year.

mbostock avatar Dec 01 '22 03:12 mbostock

If the scale has an interval option that is a year (or a multiple of a year), it seems like we could at least special-case that to use the %Y format and drop the -01-01.

mbostock avatar Apr 30 '23 02:04 mbostock

I tried just dropping -01-01 (and -01) from isoformat in #1556, but I think we need to be a little smarter and detect intervals, since otherwise with ordinal scales you are more likely to end up with inconsistent formatting of dates. (Admittedly this is already a problem with sub-daily intervals, such as hours that fall on midnight, but it does exacerbate the problem.)

For example below, we could check the domain of the x scale and choose the shortest format that applies to all of the dates in the domain (YYYY-MM-DD) rather than choosing the shortest format each value independently.

Screenshot 2023-05-15 at 10 06 17 AM

mbostock avatar May 15 '23 17:05 mbostock

#1790 handles this for temporal data. The only challenge left here is that we have very little signal that these represent years rather than arbitrary numbers (which should have commas). The possible heuristics are:

  1. Look at the channel name, and see it’s “year”.
  2. Look at the values, and check if they are integers in the range 1900–2100 (exact range TBD).

(1) is English only, which isn’t great (and we’d need to do word matching for field names like “sales year”). (2) is brittle; it wouldn’t work well for historical data, and it’ll have some false positives for other data (e.g., melting points of metals in Fahrenheit). But it’s also not the end of the world if there’s a false positive, since the only difference would be a missing comma. And we don’t really need commas for four or fewer digits anyway. So we could extend the heuristic to integers in the range 0–9999.

mbostock avatar Aug 23 '23 20:08 mbostock