openxlsx icon indicating copy to clipboard operation
openxlsx copied to clipboard

na.strings argument not working correctly

Open nicolairj opened this issue 6 years ago • 2 comments

Expected Behavior

In the previous version of openxlsx, when a cell with NA content was read using the read.xlsx function and the na.strings = "NA" argument, the result was a data.frame with NA.

Actual Behavior

Now, in version 4.1 of openxlsx, it is a NULL object.

Steps to Reproduce the Problem

(please attach an example xlsx file if possible)

  1. Run following R script to read value in attached Excel workbook.

library(openxlsx)

wb_name <- "openxlsx_bug.xlsx" cell_value <- read.xlsx(wb_name, sheet = 1, na.strings = "NA", colNames = F, rowNames = F, detectDates = F, skipEmptyRows = T, skipEmptyCols = T, rows = 3, cols = 3, check.names = F)

is.na(cell_value) is.null(cell_value) cell_value

sessionInfo()

  • Version of openxlsx: 4.1
  • Version of R: 3.5.0

openxlsx_bug.xlsx

nicolairj avatar Jun 08 '18 20:06 nicolairj

More bad behavior with na.strings. I created a small sheet (attached) that looks like this:

#   num  char
# 1   1 hello
# 2 -99   -99
# 3   3     3
# 4 N/A   N/A

I would expect read.xlsx("na-strings-test.xlsx", na.strings = c("N/A", "-99")) to result in this data frame:

### expected result
#     num char 
# 1     1 hello
# 2    NA NA   
# 3     3 3    
# 4    NA NA 

Instead, we get this, where -99 is not identified as NA (in either the character or the numeric column) and the last row of "N/A" is silently dropped:

read.xlsx("test.xlsx", na.strings = c("N/A", "-99"))
#   num  char
# 1   1 hello
# 2 -99   -99
# 3   3     3

The last row is dropped even if we set skipEmptyRow = FALSE

Came out of discussion on this Stack Overflow question.

Using openxlsx version 4.1.0 with R 3.5.2 on Windows 10.

Sample xlsx doc: na-strings-test.xlsx

gregorp avatar Feb 28 '19 16:02 gregorp

Had similar problem and a look at the source... I've added an answer to the StackOverflow question but here it is as well:

Since the replacement of NA values is done by accessing sharedStrings.xml within the .xlsx file, only strings can be replaced with na.strings. Moreover, not all programs make use of sharedStrings, therefore it can help to open your file with Microsoft Excel and save a copy if your NA strings are not replaced using read.xlsx (not an issue in your example).

Regarding the numeric NA values (e.g., -99), I find that they are less of a problem since they do not mess up the data.frame column classes, as do characters within numeric columns. I would suggest using something like this to address the issue:

library(openxlsx)
library(tidyverse)

na_strings <- c("N/A", "-99")
df <- read.xlsx("na-strings-test.xlsx", na.strings = na_strings) %>%
  mutate(across(everything(), \(x) {
    x[as.character(x) %in% na_strings] <- NA
    x
  })) # using mutate here instead of sapply to preserve column types

jkdel avatar Aug 03 '22 10:08 jkdel