readr Inconsistent `include_na` behaviour when factor levels are given

Given our recent work on factors, I was working on making the docs more clear and adding some examples.

And I found some puzzling behaviour. It's not a readr 1e vs readr 2e / vroom issue. I guess this has always been weird.

library(readr)

x <- c("a", "b", "NA")

# element 3 is NA because it matches an `na` string
# no warning, no problems, NA is a factor level
with_edition(1,
  parse_factor(x, levels = c("a", "b"), na = c("", "NA"), include_na = TRUE)
)
#> [1] a    b    <NA>
#> Levels: a b <NA>
with_edition(2,
  parse_factor(x, levels = c("a", "b"), na = c("", "NA"), include_na = TRUE)
)
#> [1] a    b    <NA>
#> Levels: a b <NA>

# element 3 is NA because its value is not found in the levels
# YES warning, YES problems, NA is NOT a factor level, despite include_na = TRUE
with_edition(1,
  parse_factor(x, levels = c("a", "b"), na = "", include_na = TRUE)
)
#> Warning: 1 parsing failure.
#> row col           expected actual
#>   3  -- value in level set     --
#> [1] a    b    <NA>
#> attr(,"problems")
#> # A tibble: 1 × 4
#>     row   col expected           actual
#>   <int> <int> <chr>              <chr> 
#> 1     3    NA value in level set NA    
#> Levels: a b
with_edition(2,
  parse_factor(x, levels = c("a", "b"), na = "", include_na = TRUE)
)
#> Warning: 1 parsing failure.
#> row col           expected actual
#>   3  -- value in level set     --
#> [1] a    b    <NA>
#> attr(,"problems")
#> # A tibble: 1 × 4
#>     row   col expected           actual
#>   <int> <int> <chr>              <chr> 
#> 1     3    NA value in level set NA    
#> Levels: a b

^{Created on 2022-03-17 by the reprex package (v2.0.1.9000)}

My motivation was wondering if include_na had any effect when levels are given.

It would seem reasonable to ignore include_na in this case. The user has provided their desired levels. If they want NA in there, they should include NA in the levels.

However, that's not what parse_factor() does.

It seems to honour include_na for an "early NA", i.e. a string that is replaced with NA due to matching one of the na strings.

And it does NOT honour include_na for a "late NA", i.e. a string that is replaced with NA due to not matching any of the levels.

This behaviour feels wrong and is definitely confusing. You basically need to know a lot about the internals to understand what's different about these two scenarios.

It also feels a bit type unstable. In the case of explicit levels, I would not expect the levels of the resulting factor to depend on what's seen in the data.

Mar 17 '22 17:03 jennybc

@sbearrows Can you double check that all of the above is also seen for col_factor()? I.e. when creating a factor via read_csv() (1e, 2e) or vroom()?

Also, is there any behaviour for base::factor()or in forcats that is useful to compare to?

Mar 17 '22 17:03 jennybc

So the behavior for read_csv() is slightly different but I think we both agree that include_na should be ignored when factors are explicit and we should not be silently adding factor levels. But it seems like for read_csv() this is only an issue for edition 1. I did also test it for vroom and didn't see any differences from readr edition 2.

library(readr)

# for edition 1 readr, NA is included in the factor levels
# which we both agree should NOT be happening

ed1 <- with_edition(
1,
read_csv("x\na\nb\nNA\n",
  col_types = cols(
    x = col_factor(levels = c("a", "b"), include_na = TRUE)
  ),
  na = c("", "NA")
)
)
ed1
#> # A tibble: 3 × 1
#>   x    
#>   <fct>
#> 1 a    
#> 2 b    
#> 3 <NA>
levels(ed1$x)
#> [1] "a" "b" NA

# for edition 2, it's not in factor levels
# include_na = TRUE has no effect here

ed2 <- with_edition(
  2,
  read_csv("x\na\nb\nNA\n",
    col_types = cols(
      x = col_factor(levels = c("a", "b"), include_na = TRUE)
    ),
    na = c("", "NA")
  )
)
ed2
#> # A tibble: 3 × 1
#>   x    
#>   <fct>
#> 1 a    
#> 2 b    
#> 3 <NA>
levels(ed2$x)
#> [1] "a" "b"

^{Created on 2022-03-17 by the reprex package (v2.0.1.9000)}

And, I get the same results when it's an NA because it's not in the explicit levels.

library(readr)
# YES warnings
# NA NOT in levels
ed1_empty <- with_edition(
  1,
  read_csv("x\na\nb\nNA\n",
    col_types = cols(
      x = col_factor(levels = c("a", "b"), include_na = TRUE)
    ),
    na = ""
  )
)
#> Warning: 1 parsing failure.
#> row col           expected actual         file
#>   3   x value in level set     -- literal data
ed1_empty
#> # A tibble: 3 × 1
#>   x    
#>   <fct>
#> 1 a    
#> 2 b    
#> 3 <NA>
levels(ed1_empty$x)
#> [1] "a" "b"

# same for edition 2
ed2_empty <- with_edition(
  2,
  read_csv("x\na\nb\nNA\n",
    col_types = cols(
      x = col_factor(levels = c("a", "b"), include_na = TRUE)
    ),
    na = ""
  )
)
#> Warning: One or more parsing issues, see `problems()` for details
ed2_empty
#> # A tibble: 3 × 1
#>   x    
#>   <fct>
#> 1 a    
#> 2 b    
#> 3 <NA>
levels(ed2_empty$x)
#> [1] "a" "b"

^{Created on 2022-03-17 by the reprex package (v2.0.1.9000)}

I haven't looked into base::factor() or forcats yet.

Mar 18 '22 00:03 sbearrows