haven icon indicating copy to clipboard operation
haven copied to clipboard

read_sas() bug with SAS formats

Open dwilson98-kermit opened this issue 1 year ago • 6 comments

read_sas() fails to appropriately process SAS formats that include negative values. For example, if you have a .sas7bdat file with a single numeric variable named x with values -7, 1, and 2 and a SAS format catalog file that defines the format:

proc format; value testf -7="Missing" 1="Yes" 2="No" ; run;

Then the attributes of dataframe$x show a value of -0.625 instead of -7.

``` r
library(haven)
bweights <- read_sas('C:/Test/test.sas7bdat', catalog_file='C:/Test/formats.sas7bcat')
attributes(bweights$x)
#> $format.sas
#> [1] "TESTF"
#> 
#> $class
#> [1] "haven_labelled" "vctrs_vctr"     "double"        
#> 
#> $labels
#> Missing     Yes      No 
#>  -0.625   1.000   2.000
[test.zip](https://github.com/user-attachments/files/18101365/test.zip)

dwilson98-kermit avatar Dec 11 '24 19:12 dwilson98-kermit

I'd be curious enough to look try and look into this but I don't have access to SAS on my personal machine, would you be able to provide the relevant bdat/bcat files ?

gowerc avatar Mar 03 '25 17:03 gowerc

I'd be curious enough to look try and look into this but I don't have access to SAS on my personal machine, would you be able to provide the relevant bdat/bcat files ?

You can find a zip file with an example SAS data set and SAS catalog file here: https://github.com/user-attachments/files/18101365/test.zip

For what it's worth, Pyreadstat is able to correctly parse the format values and labels. I thought that Haven and Pyreadstat relied on the same underlying C code to read SAS data sets but perhaps they use separate code for parsing SAS catalog files.

dwilson98-kermit avatar Mar 03 '25 18:03 dwilson98-kermit

Looks like it is a bug with the underlying readstat library:

At least using my own code calling the library directly I am seeing the same result:

extern "C" {
    #include "readstat.h"
}

#include <iostream>

int handle_var_label(const char *val_labels, readstat_value_t value, const char *label, void *ctx) {
    std::cout << "---Var Label---" << std::endl;
    std::cout << "label = " << label << std::endl;
    std::cout << "value = ";

    readstat_type_t type = readstat_value_type(value);
    if (!readstat_value_is_system_missing(value)) {
        if (type == READSTAT_TYPE_STRING) {
            std::cout << readstat_string_value(value);
        } else if (type == READSTAT_TYPE_INT8) {
            std::cout << readstat_int8_value(value);
        } else if (type == READSTAT_TYPE_INT16) {
            std::cout << readstat_int16_value(value);
        } else if (type == READSTAT_TYPE_INT32) {
            std::cout << readstat_int32_value(value);
        } else if (type == READSTAT_TYPE_FLOAT) {
            std::cout << readstat_float_value(value);
        } else if (type == READSTAT_TYPE_DOUBLE) {
            std::cout << readstat_double_value(value);
        }
    } 
    std::cout << "\n " << std::endl;
    return 0;
}

int main() {
    int my_count = 0;
    readstat_error_t error = READSTAT_OK;
    readstat_parser_t *parser = readstat_parser_init();
    readstat_set_value_label_handler(parser, &handle_var_label);
    error = readstat_parse_sas7bcat(parser, "data/xxx_formats.sas7bcat", &my_count);
    readstat_parser_free(parser);
    if (error != READSTAT_OK) {
        printf("Error processing");
        return 1;
    }
    printf("Found %d records\n", my_count);
    return 0;
}
---Var Label---
label = Missing
value = -0.625
 
---Var Label---
label = Yes
value = 1
 
---Var Label---
label = No
value = 2

Unfortunately this likely means you'll need to re-raise the issue at https://github.com/WizardMac/ReadStat

--EDIT--

For what it's worth, Pyreadstat is able to correctly parse the format values and labels. I thought that Haven and Pyreadstat relied on the same underlying C code to read SAS data sets but perhaps they use separate code for parsing SAS catalog files.

Apologies I had missed this... that's weird. I work with the developer of pyreadstat so will message them tomorrow to try and see how/why they are getting the correct values.

gowerc avatar Mar 05 '25 21:03 gowerc

@dwilson98-kermit

For what it's worth, Pyreadstat is able to correctly parse the format values and labels. I thought that Haven and Pyreadstat relied on the same underlying C code to read SAS data sets but perhaps they use separate code for parsing SAS catalog files.

Err would you be able to double check this, I just tried a toy example and I am seeing the exact same issue in pyreadstat as well:

proc format library=mylib;
    value myformatA
    -7="Missing"
    1="Yes"
    2="No"
    ;
    value myformatB
    -7="is -7"
    1="is 1"
    2="is 2"
    ;
run;
>>> pyreadstat.pyreadstat.read_sas7bdat(
...     "./jdat.sas7bdat",
...     catalog_file="./formats.sas7bcat"
... )
(     X     Y
0  Yes  is 1
1   No  is 2
2   No  is 2
3 -7.0  -7.0
4  Yes  is 1

>>> pyreadstat.pyreadstat.read_sas7bcat("./formats.sas7bcat")[1].value_labels
{'MYFORMATA': {-0.6249999999999999: 'Missing', 1.0: 'Yes', 2.0: 'No'}, 'MYFORMATB': {-0.6249999999999999: 'is -7', 1.0: 'is 1', 2.0: 'is 2'}}

This looks like an upstream bug with readstat so I think you'd need to create an issue there.

gowerc avatar Mar 06 '25 11:03 gowerc

Unfortunately, I misread the output from Pyreadstat. You are correct, Pyreadstat is not able to parse the format values correctly. Thanks for looking into this.

dwilson98-kermit avatar Mar 06 '25 14:03 dwilson98-kermit

This upstream commit may fix the issue: https://github.com/WizardMac/ReadStat/commit/974a3fe7d3047098a7d9c4d30a5f317be146479b

evanmiller avatar May 26 '25 13:05 evanmiller

I think this was closed with #777

szimmer avatar Jul 10 '25 20:07 szimmer

Confirming that this was fixed in #777!

gorcha avatar Nov 23 '25 13:11 gorcha