haven icon indicating copy to clipboard operation
haven copied to clipboard

read_sav fails if two string variable names are exactly 5 characters + one digit

Open ookiiwani opened this issue 5 years ago • 4 comments

Several days of debugging lead to this. If there is

  • a data frame
  • with at least two string variables
  • and both string variables are longer than 255 bytes
  • and the variable names are almost the same
  • and they end with a consecutive number (e.g. 1 and 2)
  • and the variable name length is exactly 5 characters long (without the number at the end)

then read_sav will fail with data written by write_sav.

library(haven)

# Create a dataset with at least two string variables which are greater than 255 bytes.
x = paste(rep('a', 256), collapse = '')
dat = data.frame(x, x, stringsAsFactors = FALSE)

# This produces an error during read_sav
colnames(dat) = c('aaaaa1', 'aaaaa2')
write_sav(dat, 'test.sav')
read_sav('test.sav')

# This works fine
colnames(dat) = c('aaaa1', 'aaaa2')
write_sav(dat, 'test.sav')
read_sav('test.sav')

ookiiwani avatar Oct 14 '20 19:10 ookiiwani

I think I can confirm this issue with my data set when trying to read in data with read_sav. Having my original variable names being QB5B_1 and QB5B_2, readin in fails with:

New names:
* QB5B_1 -> QB5B_1...73
* QB5B_1 -> QB5B_1...74

The data is read in, however, the problematic character column is split up after 255 characters and new columns are created (each with 255 characters).

deschen1 avatar Dec 21 '20 14:12 deschen1

Reprex:

library(haven)
path <- tempfile()

# Dataset with at >=2 string variables that are >255 bytes
x <- paste(rep('a', 256), collapse = '')
df <- data.frame(aaaaa1 = x, aaaaa2 = x)

write_sav(df, path)
names(read_sav(path))
#> New names:
#> * AAAAA1 -> AAAAA1...1
#> * AAAAA1 -> AAAAA1...2
#> [1] "AAAAA1...1" "AAAAA1...2" "aaaaa2"

Created on 2021-04-08 by the reprex package (v2.0.0)

@evanmiller looks like a ReadStat issue?

hadley avatar Apr 08 '21 21:04 hadley

Very likely a ReadStat issue - SPSS splits up long variables and names the components internally with a 5-character stem and a sequential number. This of course leads to clashes with real variable names.

The current logic for writing internal variable names can be found here:

https://github.com/WizardMac/ReadStat/blob/69f55186ae615a14a3367ad5cd08b7829aa8f308/src/spss/readstat_sav_write.c#L93-L104

This will likely need to be updated to match SPSS's behavior.

evanmiller avatar Apr 12 '21 16:04 evanmiller

Just wondering, is this bug already fixed now in ReadStat and is the bugfix also implemented in haven now? I'm asking because I came across this very unfortunate bug again and it occurs not only with the naming pattern mentioned in the initial post, i.e. in my case the columsn have names:

  • Q16r996oe
  • Q17r996_17oe
  • Q30r996oe

So they a) don't end with a number and b) are not exactly 5 characters long.

Now I was able to workaround the bug, by renaming these vars to e.g. "Q16other_oe", "Q17other_oe" and so on.

However, some other naming options didn't solve the problem, e.g. "Q16r996_oe_rec" didn't work.

It also made a difference where these vars occur in the data set, so position also played a role.

deschen1 avatar Jul 23 '21 10:07 deschen1