Inconsistent `include_na` behaviour when factor levels are given
Given our recent work on factors, I was working on making the docs more clear and adding some examples.
And I found some puzzling behaviour. It's not a readr 1e vs readr 2e / vroom issue. I guess this has always been weird.
library(readr)
x <- c("a", "b", "NA")
# element 3 is NA because it matches an `na` string
# no warning, no problems, NA is a factor level
with_edition(1,
parse_factor(x, levels = c("a", "b"), na = c("", "NA"), include_na = TRUE)
)
#> [1] a b <NA>
#> Levels: a b <NA>
with_edition(2,
parse_factor(x, levels = c("a", "b"), na = c("", "NA"), include_na = TRUE)
)
#> [1] a b <NA>
#> Levels: a b <NA>
# element 3 is NA because its value is not found in the levels
# YES warning, YES problems, NA is NOT a factor level, despite include_na = TRUE
with_edition(1,
parse_factor(x, levels = c("a", "b"), na = "", include_na = TRUE)
)
#> Warning: 1 parsing failure.
#> row col expected actual
#> 3 -- value in level set --
#> [1] a b <NA>
#> attr(,"problems")
#> # A tibble: 1 × 4
#> row col expected actual
#> <int> <int> <chr> <chr>
#> 1 3 NA value in level set NA
#> Levels: a b
with_edition(2,
parse_factor(x, levels = c("a", "b"), na = "", include_na = TRUE)
)
#> Warning: 1 parsing failure.
#> row col expected actual
#> 3 -- value in level set --
#> [1] a b <NA>
#> attr(,"problems")
#> # A tibble: 1 × 4
#> row col expected actual
#> <int> <int> <chr> <chr>
#> 1 3 NA value in level set NA
#> Levels: a b
Created on 2022-03-17 by the reprex package (v2.0.1.9000)
My motivation was wondering if include_na had any effect when levels are given.
It would seem reasonable to ignore include_na in this case. The user has provided their desired levels. If they want NA in there, they should include NA in the levels.
However, that's not what parse_factor() does.
It seems to honour include_na for an "early NA", i.e. a string that is replaced with NA due to matching one of the na strings.
And it does NOT honour include_na for a "late NA", i.e. a string that is replaced with NA due to not matching any of the levels.
This behaviour feels wrong and is definitely confusing. You basically need to know a lot about the internals to understand what's different about these two scenarios.
It also feels a bit type unstable. In the case of explicit levels, I would not expect the levels of the resulting factor to depend on what's seen in the data.
@sbearrows Can you double check that all of the above is also seen for col_factor()? I.e. when creating a factor via read_csv() (1e, 2e) or vroom()?
Also, is there any behaviour for base::factor()or in forcats that is useful to compare to?
So the behavior for read_csv() is slightly different but I think we both agree that include_na should be ignored when factors are explicit and we should not be silently adding factor levels. But it seems like for read_csv() this is only an issue for edition 1. I did also test it for vroom and didn't see any differences from readr edition 2.
library(readr)
# for edition 1 readr, NA is included in the factor levels
# which we both agree should NOT be happening
ed1 <- with_edition(
1,
read_csv("x\na\nb\nNA\n",
col_types = cols(
x = col_factor(levels = c("a", "b"), include_na = TRUE)
),
na = c("", "NA")
)
)
ed1
#> # A tibble: 3 × 1
#> x
#> <fct>
#> 1 a
#> 2 b
#> 3 <NA>
levels(ed1$x)
#> [1] "a" "b" NA
# for edition 2, it's not in factor levels
# include_na = TRUE has no effect here
ed2 <- with_edition(
2,
read_csv("x\na\nb\nNA\n",
col_types = cols(
x = col_factor(levels = c("a", "b"), include_na = TRUE)
),
na = c("", "NA")
)
)
ed2
#> # A tibble: 3 × 1
#> x
#> <fct>
#> 1 a
#> 2 b
#> 3 <NA>
levels(ed2$x)
#> [1] "a" "b"
Created on 2022-03-17 by the reprex package (v2.0.1.9000)
And, I get the same results when it's an NA because it's not in the explicit levels.
library(readr)
# YES warnings
# NA NOT in levels
ed1_empty <- with_edition(
1,
read_csv("x\na\nb\nNA\n",
col_types = cols(
x = col_factor(levels = c("a", "b"), include_na = TRUE)
),
na = ""
)
)
#> Warning: 1 parsing failure.
#> row col expected actual file
#> 3 x value in level set -- literal data
ed1_empty
#> # A tibble: 3 × 1
#> x
#> <fct>
#> 1 a
#> 2 b
#> 3 <NA>
levels(ed1_empty$x)
#> [1] "a" "b"
# same for edition 2
ed2_empty <- with_edition(
2,
read_csv("x\na\nb\nNA\n",
col_types = cols(
x = col_factor(levels = c("a", "b"), include_na = TRUE)
),
na = ""
)
)
#> Warning: One or more parsing issues, see `problems()` for details
ed2_empty
#> # A tibble: 3 × 1
#> x
#> <fct>
#> 1 a
#> 2 b
#> 3 <NA>
levels(ed2_empty$x)
#> [1] "a" "b"
Created on 2022-03-17 by the reprex package (v2.0.1.9000)
I haven't looked into base::factor() or forcats yet.