openxlsx
openxlsx copied to clipboard
na.strings argument not working correctly
Expected Behavior
In the previous version of openxlsx, when a cell with NA content was read using the read.xlsx function and the na.strings = "NA" argument, the result was a data.frame with NA.
Actual Behavior
Now, in version 4.1 of openxlsx, it is a NULL object.
Steps to Reproduce the Problem
(please attach an example xlsx file if possible)
- Run following R script to read value in attached Excel workbook.
library(openxlsx)
wb_name <- "openxlsx_bug.xlsx" cell_value <- read.xlsx(wb_name, sheet = 1, na.strings = "NA", colNames = F, rowNames = F, detectDates = F, skipEmptyRows = T, skipEmptyCols = T, rows = 3, cols = 3, check.names = F)
is.na(cell_value) is.null(cell_value) cell_value
sessionInfo()
- Version of openxlsx: 4.1
- Version of R: 3.5.0
More bad behavior with na.strings
. I created a small sheet (attached) that looks like this:
# num char
# 1 1 hello
# 2 -99 -99
# 3 3 3
# 4 N/A N/A
I would expect read.xlsx("na-strings-test.xlsx", na.strings = c("N/A", "-99"))
to result in this data frame:
### expected result
# num char
# 1 1 hello
# 2 NA NA
# 3 3 3
# 4 NA NA
Instead, we get this, where -99
is not identified as NA
(in either the character or the numeric column) and the last row of "N/A"
is silently dropped:
read.xlsx("test.xlsx", na.strings = c("N/A", "-99"))
# num char
# 1 1 hello
# 2 -99 -99
# 3 3 3
The last row is dropped even if we set skipEmptyRow = FALSE
Came out of discussion on this Stack Overflow question.
Using openxlsx
version 4.1.0 with R 3.5.2 on Windows 10.
Sample xlsx doc: na-strings-test.xlsx
Had similar problem and a look at the source... I've added an answer to the StackOverflow question but here it is as well:
Since the replacement of NA values is done by accessing sharedStrings.xml within the .xlsx file, only strings can be replaced with na.strings
. Moreover, not all programs make use of sharedStrings
, therefore it can help to open your file with Microsoft Excel and save a copy if your NA strings are not replaced using read.xlsx
(not an issue in your example).
Regarding the numeric NA values (e.g., -99), I find that they are less of a problem since they do not mess up the data.frame
column classes, as do characters within numeric columns. I would suggest using something like this to address the issue:
library(openxlsx)
library(tidyverse)
na_strings <- c("N/A", "-99")
df <- read.xlsx("na-strings-test.xlsx", na.strings = na_strings) %>%
mutate(across(everything(), \(x) {
x[as.character(x) %in% na_strings] <- NA
x
})) # using mutate here instead of sapply to preserve column types