ggplot2 stat_count gives cryptic error when used on a column of doubles

Run this trivial code (csv is attached):

library(tidyverse)

data <- read_csv("data.csv")

data %>%
  ggplot(aes(x=Tenure)) +
  geom_bar()

You get this warning:

Warning message:
Computation failed in `stat_count()`:
Elements must equal the number of rows or 1

This is the resulting plot:

The data: data.csv

Expected behavior: it should "just work". All the data in this tibble is just doubles. I reviewed the geom_bar documentation, and I see no contraindications for this working.

If I have done something wrong here, then this becomes a feature request for a useful error message or improved documentation.

Sep 08 '21 20:09 arencambre

Bar plots are for categorical data, histograms for numerical data. You're trying to make a bar plot from numerical data. That makes no sense. Try geom_histogram() or turn your data into a factor.

Sep 08 '21 20:09 clauswilke

I was wondering if that was the case, and I sympathize with your point.

While you are correct that the Tenure column is numerical data, it is still 36 unique values (categories) over 835 observations (confirmed via unique(data$Tenure)).

data <- read_csv("data.csv", col_types = "c") does cause it to work, but this seems unnecessary since, again, there are 36 unique values.

All my bloviating aside, it would be great if the error message or documentation could be adjusted to help with cases like this.

Sep 08 '21 20:09 arencambre

Aha, rounding the values causes it to work: data <- read_csv("data.csv") %>% mutate(Tenure = round(Tenure, 2))

Something odd is going on here. I wonder if it's getting tripped up by some of the values being doubles that need rounding? E.g., row 8 of the CSV is 1.7999999999999998.

Sep 08 '21 21:09 arencambre

Minimal reprex:

library(ggplot2)

df <- data.frame(x = rep(c(1, 2), 5) + rep(c(0, -2.220446e-16), c(4, 1)))
df
#>    x
#> 1  1
#> 2  2
#> 3  1
#> 4  2
#> 5  1
#> 6  2
#> 7  1
#> 8  2
#> 9  1
#> 10 2
ggplot(df, aes(x)) + geom_bar()
#> Warning: Computation failed in `stat_count()`:
#> Elements must equal the number of rows or 1

^{Created on 2022-03-15 by the reprex package (v2.0.1)}

Since this seems like a FP buglet I think it's worth taking a bit of a look to see what's going wrong.

Mar 15 '22 14:03 hadley

It seems the problem is that the criteria of the "same" value differ between vctrs::vec_unique() (which is used in unique0()) and as.factor() (in tapply()).

https://github.com/tidyverse/ggplot2/blob/a979ffd26cdb456d54e2671c2eed16c65bc878b7/R/stat-count.r#L79-L89

df <- data.frame(x = rep(c(1, 2), 5) + rep(c(0, -2.220446e-16), c(4, 1)))
ggplot2:::unique0(df$x)
#> [1] 1 2 1 2

tapply(rep(1, times = nrow(df)), df$x, sum, na.rm = TRUE)
#> 1 2 
#> 5 5
as.factor(df$x)
#>  [1] 1 2 1 2 1 2 1 2 1 2
#> Levels: 1 2

^{Created on 2022-07-23 by the reprex package (v2.0.1)}

Jul 23 '22 06:07 yutannihilation

In such case, should we add a tolerance or treat them as unequal? If treated as unequal, we could replace the tapply() by rowsum().

Nov 13 '23 10:11 teunbrand

I'd say we follow whatever vec_unique does.

Nov 13 '23 13:11 hadley