ReadStat icon indicating copy to clipboard operation
ReadStat copied to clipboard

Importing SAS formats catalogue with negative format values

Open Adamishere opened this issue 9 months ago • 4 comments

Passing along an issue presented in the R haven package, which appears to be an upstream issue with ReadStat that haven uses: https://github.com/tidyverse/haven/issues/768.

To summarize, in the attached example (test.zip), if you have a sas7bdat file (test.sas7bdat) with a single numeric variable named x with values -7, 1, and 2 and a SAS format catalog file that defines the format (format.sas7bcat):

proc format;
value testf
-7="Missing"
1="Yes"
2="No"
;
run;

The format value -7 = "Missing" gets imported by haven (using ReadStat) as -0.625 = "Missing". They also noted that they can reproduce this error in pyreadstats as well and suggested it may be an upstream issue with ReadStat.

Some additional investigation by me (not in the attached example) suggests sort of deterministic pattern in between the original SAS format values and the transformed ReadStat values. I noticed that the lagged difference of the imported values change in increasing doubles 1x, 2x, 4x, and when the lag differences change, they descrease by a factor of 4 (e.g., 2.00 -> 0.50 -> 0.125).

SAS Format Value	Imported value		(Lagged difference of Imported Value) 
-1					-4.0000000			 N/A
-2					-2.0000000			 2.000000000
-3					-1.5000000			 0.500000000
-4					-1.0000000			 0.500000000
-5					-0.8750000			 0.125000000
-6					-0.7500000			 0.125000000
-7					-0.6250000			 0.125000000
-8					-0.5000000			 0.125000000
...					...					...   

Adamishere avatar Mar 31 '25 18:03 Adamishere

Interesting. The relevant code is here:

https://github.com/WizardMac/ReadStat/blob/3438f3431911899ba52566180f258f405a53b12e/src/sas/readstat_sas7bcat_read.c#L104-L113

Positive values are encoded as negative double-precision floating points, but it looks like negative values aren't just sign-flipped positive values.

evanmiller avatar May 20 '25 14:05 evanmiller

Would be great if you could supply another test file with more negative values that I can inspect.

evanmiller avatar May 22 '25 18:05 evanmiller

Attached is an updated example, formats for values -300 to 2, where the format labels are just the character string of the number e.g., "-300".

test2.zip

Hope it helps!

Adamishere avatar May 22 '25 20:05 Adamishere

Great! I think this should do the job? https://github.com/WizardMac/ReadStat/commit/974a3fe7d3047098a7d9c4d30a5f317be146479b

evanmiller avatar May 22 '25 21:05 evanmiller