read_sav fails if two string variable names are exactly 5 characters + one digit
Several days of debugging lead to this. If there is
- a data frame
- with at least two string variables
- and both string variables are longer than 255 bytes
- and the variable names are almost the same
- and they end with a consecutive number (e.g. 1 and 2)
- and the variable name length is exactly 5 characters long (without the number at the end)
then read_sav will fail with data written by write_sav.
library(haven)
# Create a dataset with at least two string variables which are greater than 255 bytes.
x = paste(rep('a', 256), collapse = '')
dat = data.frame(x, x, stringsAsFactors = FALSE)
# This produces an error during read_sav
colnames(dat) = c('aaaaa1', 'aaaaa2')
write_sav(dat, 'test.sav')
read_sav('test.sav')
# This works fine
colnames(dat) = c('aaaa1', 'aaaa2')
write_sav(dat, 'test.sav')
read_sav('test.sav')
I think I can confirm this issue with my data set when trying to read in data with read_sav. Having my original variable names being QB5B_1 and QB5B_2, readin in fails with:
New names:
* QB5B_1 -> QB5B_1...73
* QB5B_1 -> QB5B_1...74
The data is read in, however, the problematic character column is split up after 255 characters and new columns are created (each with 255 characters).
Reprex:
library(haven)
path <- tempfile()
# Dataset with at >=2 string variables that are >255 bytes
x <- paste(rep('a', 256), collapse = '')
df <- data.frame(aaaaa1 = x, aaaaa2 = x)
write_sav(df, path)
names(read_sav(path))
#> New names:
#> * AAAAA1 -> AAAAA1...1
#> * AAAAA1 -> AAAAA1...2
#> [1] "AAAAA1...1" "AAAAA1...2" "aaaaa2"
Created on 2021-04-08 by the reprex package (v2.0.0)
@evanmiller looks like a ReadStat issue?
Very likely a ReadStat issue - SPSS splits up long variables and names the components internally with a 5-character stem and a sequential number. This of course leads to clashes with real variable names.
The current logic for writing internal variable names can be found here:
https://github.com/WizardMac/ReadStat/blob/69f55186ae615a14a3367ad5cd08b7829aa8f308/src/spss/readstat_sav_write.c#L93-L104
This will likely need to be updated to match SPSS's behavior.
Just wondering, is this bug already fixed now in ReadStat and is the bugfix also implemented in haven now? I'm asking because I came across this very unfortunate bug again and it occurs not only with the naming pattern mentioned in the initial post, i.e. in my case the columsn have names:
- Q16r996oe
- Q17r996_17oe
- Q30r996oe
So they a) don't end with a number and b) are not exactly 5 characters long.
Now I was able to workaround the bug, by renaming these vars to e.g. "Q16other_oe", "Q17other_oe" and so on.
However, some other naming options didn't solve the problem, e.g. "Q16r996_oe_rec" didn't work.
It also made a difference where these vars occur in the data set, so position also played a role.