read_sas() bug with SAS formats
read_sas() fails to appropriately process SAS formats that include negative values. For example, if you have a .sas7bdat file with a single numeric variable named x with values -7, 1, and 2 and a SAS format catalog file that defines the format:
proc format; value testf -7="Missing" 1="Yes" 2="No" ; run;
Then the attributes of dataframe$x show a value of -0.625 instead of -7.
``` r
library(haven)
bweights <- read_sas('C:/Test/test.sas7bdat', catalog_file='C:/Test/formats.sas7bcat')
attributes(bweights$x)
#> $format.sas
#> [1] "TESTF"
#>
#> $class
#> [1] "haven_labelled" "vctrs_vctr" "double"
#>
#> $labels
#> Missing Yes No
#> -0.625 1.000 2.000
[test.zip](https://github.com/user-attachments/files/18101365/test.zip)
I'd be curious enough to look try and look into this but I don't have access to SAS on my personal machine, would you be able to provide the relevant bdat/bcat files ?
I'd be curious enough to look try and look into this but I don't have access to SAS on my personal machine, would you be able to provide the relevant bdat/bcat files ?
You can find a zip file with an example SAS data set and SAS catalog file here: https://github.com/user-attachments/files/18101365/test.zip
For what it's worth, Pyreadstat is able to correctly parse the format values and labels. I thought that Haven and Pyreadstat relied on the same underlying C code to read SAS data sets but perhaps they use separate code for parsing SAS catalog files.
Looks like it is a bug with the underlying readstat library:
At least using my own code calling the library directly I am seeing the same result:
extern "C" {
#include "readstat.h"
}
#include <iostream>
int handle_var_label(const char *val_labels, readstat_value_t value, const char *label, void *ctx) {
std::cout << "---Var Label---" << std::endl;
std::cout << "label = " << label << std::endl;
std::cout << "value = ";
readstat_type_t type = readstat_value_type(value);
if (!readstat_value_is_system_missing(value)) {
if (type == READSTAT_TYPE_STRING) {
std::cout << readstat_string_value(value);
} else if (type == READSTAT_TYPE_INT8) {
std::cout << readstat_int8_value(value);
} else if (type == READSTAT_TYPE_INT16) {
std::cout << readstat_int16_value(value);
} else if (type == READSTAT_TYPE_INT32) {
std::cout << readstat_int32_value(value);
} else if (type == READSTAT_TYPE_FLOAT) {
std::cout << readstat_float_value(value);
} else if (type == READSTAT_TYPE_DOUBLE) {
std::cout << readstat_double_value(value);
}
}
std::cout << "\n " << std::endl;
return 0;
}
int main() {
int my_count = 0;
readstat_error_t error = READSTAT_OK;
readstat_parser_t *parser = readstat_parser_init();
readstat_set_value_label_handler(parser, &handle_var_label);
error = readstat_parse_sas7bcat(parser, "data/xxx_formats.sas7bcat", &my_count);
readstat_parser_free(parser);
if (error != READSTAT_OK) {
printf("Error processing");
return 1;
}
printf("Found %d records\n", my_count);
return 0;
}
---Var Label---
label = Missing
value = -0.625
---Var Label---
label = Yes
value = 1
---Var Label---
label = No
value = 2
Unfortunately this likely means you'll need to re-raise the issue at https://github.com/WizardMac/ReadStat
--EDIT--
For what it's worth, Pyreadstat is able to correctly parse the format values and labels. I thought that Haven and Pyreadstat relied on the same underlying C code to read SAS data sets but perhaps they use separate code for parsing SAS catalog files.
Apologies I had missed this... that's weird. I work with the developer of pyreadstat so will message them tomorrow to try and see how/why they are getting the correct values.
@dwilson98-kermit
For what it's worth, Pyreadstat is able to correctly parse the format values and labels. I thought that Haven and Pyreadstat relied on the same underlying C code to read SAS data sets but perhaps they use separate code for parsing SAS catalog files.
Err would you be able to double check this, I just tried a toy example and I am seeing the exact same issue in pyreadstat as well:
proc format library=mylib;
value myformatA
-7="Missing"
1="Yes"
2="No"
;
value myformatB
-7="is -7"
1="is 1"
2="is 2"
;
run;
>>> pyreadstat.pyreadstat.read_sas7bdat(
... "./jdat.sas7bdat",
... catalog_file="./formats.sas7bcat"
... )
( X Y
0 Yes is 1
1 No is 2
2 No is 2
3 -7.0 -7.0
4 Yes is 1
>>> pyreadstat.pyreadstat.read_sas7bcat("./formats.sas7bcat")[1].value_labels
{'MYFORMATA': {-0.6249999999999999: 'Missing', 1.0: 'Yes', 2.0: 'No'}, 'MYFORMATB': {-0.6249999999999999: 'is -7', 1.0: 'is 1', 2.0: 'is 2'}}
This looks like an upstream bug with readstat so I think you'd need to create an issue there.
Unfortunately, I misread the output from Pyreadstat. You are correct, Pyreadstat is not able to parse the format values correctly. Thanks for looking into this.
This upstream commit may fix the issue: https://github.com/WizardMac/ReadStat/commit/974a3fe7d3047098a7d9c4d30a5f317be146479b
I think this was closed with #777
Confirming that this was fixed in #777!