plot icon indicating copy to clipboard operation
plot copied to clipboard

Communicate information about filtered data points

Open yurivish opened this issue 4 years ago • 4 comments

It would be useful if exploratory plots came with a visual indicator of “discarded data”.

This would improve Plot's capacity for exploratory data analysis by enabling users to become aware of anomalous values that violate their assumptions about the data.

For example, I changed a scale from log to symlog and discovered a bunch of negative values where I wasn’t expecting any.

The data was supposed to be strictly positive and the negative values indicated a processing error, but since the default log scale filtered those data points out I only noticed because I went out of my way to do additional spot checks.

Plot could have made it evident immediately, e.g. with a legend saying something like “100 datapoints not shown”. Even more useful (maybe) would be being able to see a "data pipeline" and how many points are filtered out at each stage.

@fil observes that some filters use the discarding as a basic mechanism to do their work as intended, so there are subtle questions about what to communicate for this to be a useful signal.

For the exploratory use case I think it makes sense for this to be on by default, since spot-checking every individual assumption manually can get onerous (e.g. checking for null/undefined, zeros where there shouldn’t be any, negative numbers where there shouldn’t be any, values outside of the x/y/color domain, NaN, etc.)

A separate tool such as a summary table could be used to learn about missing/pathological data in a dataset, but it would still be useful for Plot to flag these issues since they can creep in during downstream processing and plot transformations.

yurivish avatar Aug 10 '21 17:08 yurivish

The scale.unknown option can be used to this effect — examples.

Fil avatar Sep 27 '21 22:09 Fil

This would now happen, I guess, in the default filter https://github.com/observablehq/plot/blob/9d9ba917b5eb3f58e35b81e6555f3565ecffe8eb/src/plot.js#L291 . However with each warning we need to indicate a way to fix the situation, and in this case I wouldn't know what to say, in particular because in many charts some data is ignored on purpose.

Fil avatar Jul 18 '22 16:07 Fil

As an additional twist on this, it would be great if we could provide informative error messages for two seemingly common cases:

  • somebody gets the capitalization of a key wrong, e.g., city instead of City
  • somebody misspells the name of a key, e.g., delivert instead of delivery

Likely candidates for capitalization errors could be found by comparing the key provided to all the keys in the input object in a way that ignores case (i.e., converting both to lowercase before comparing).

Misspellings are more complex than that, possibly using Levenshtein distance and a threshold (or finding the closest match and suggesting that).

The latter is an expensive operation, but it would only have to be run when there's an error (or a presumed error), and it would mostly delay the error message, not interfere with normal Plot operation.

eagereyes avatar Jul 27 '22 00:07 eagereyes

It would be useful to also generate a warning when the given data as a whole is nullish, e.g. Plot.lineY(undefined, { x: 'date', y: 'population' }). I've been going a little bit nuts trying to figure out which one of the 7-8 plots on my dashboard where throwing a seemingly random Error: missing scale: y.

Not a fault of Plot that my data is broken of course, but a message like Error: lineY data series is undefined or something more to the point would at least have helped narrow it down.

The documentation does state that Missing and invalid data are handled specifically for each mark type and channel. but this seems in my (admittedly limited) testing to only hold true for datums, not the series as a whole. Simply replacing undefined with [] in my case did the trick for the mark throwing errors.

mstade avatar Nov 20 '23 18:11 mstade