ENH: Add support for reading 110-format Stata dta files
- [x] closes #47176
- [x] Tests added and passed if fixing a bug or adding a new feature
- [x] All code checks passed.
- [ ] Added type annotations to new arguments/methods/functions.
- [ ] Added an entry in the latest
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.
This change enables the ability to read 110-format (Stata 7) dta files. A test data file is included in the same style as other supported versions.
I have now added the whatsnew line as requested.
cc @bashtage
Do we have documentation that there are no differences between 110 and 111? This seems to be the assumption here.
There is a difference between 110 and 111 - The 110 format uses the older typlist codes which limit string variables to a maximum of 80 characters. There is official documentation for the 110 format in the Stata 7 manual:
However I have not been able to track any down for the already supported 111 format to compare with.
Small comment. Is it really the case that there are no difference between 108 and 110 aside from the version number? Have you done a diff of the documentation to verify?
The differences between 108 and 110 are that the maximum variable name length was increased from 8 to 32 characters (https://www.stata.com/stata7/language.html#longnames) and the expansion record field size was increased from 2 to 4 bytes. The display format also now allowed European decimals (https://www.stata.com/stata7/language.html#andmore), but that doesn't make a difference to the file structure.
My understanding is that the 110 and 111 formats are the same, other than different typlist encodings, which would make sense as they're both for Stata 7, but for editions with different limits. It appears that Stata 7/SE was released around a year after Stata 7/IC and Small Stata 7 (see 01feb2002 entry in https://www.stata.com/help.cgi?whatsnew7), which might explain the lack of documentation.
Assuming that the 111 format is implemented correctly then it seems that the 113 format is the same as 111, except that the values used to encode missing values were changed to allow 26 additional missing codes (see https://www.stata.com/help.cgi?whatsnew7to8).
In case it helps, here are the changes that I have determined between each format version (excluding changes to the display format codes) from looking at the available documentation:
102 (confirmed as undocumented but can be inferred from the next version, the Stata 1 manual and a "history of Stata" article)
- Data is stored in little-endian format
- Supports 2-byte and 4-byte integer variables
- Supports 4-byte and 8-byte float point variables
- Number of variables stored in 2-byte integer
- Number of observations stored in 2-byte integer
- Variable and value label names up to 9 characters (including null terminator)
- Data and variable labels up to 32 characters (including null terminator)
- Value labels up to 8 characters (null terminator is omitted if label is 8 characters)
- Variable format information up to 7 characters (including null terminator)
- Valid string characters are ASCII codes 1-127
- Single missing value supported for each variable type (.)
103 (documented in Stata 2 manual)
- Allow choice of little or big-endian bit ordering
- Number of observations stored in 4-byte integer
- Added str1 to str80 string variable types
104 (documentation not yet located - probably in Stata 3 manual)
- Added byte variable type
105 (documented in Stata 4 and 5 manuals)
- Added 0 or 17 character time-stamp record stored in 18 characters (including null terminator)
- Added expansion fields with 2-byte integer records (used to store variable characteristics)
- Storage for variable format information increased to 12 characters (including null terminator)
108 (documented in Stata 6 manual)
- Valid string characters are ASCII codes 1-255
- Data and variable label length increased to 81 characters (including null terminator)
- Value label length is no longer fixed
- Underlying missing value code changed for double type
110 (documented in Stata 7 manual)
- Maximum variable and value label name increased to 33 characters (including null terminator)
- Expansion record length field size increased from 2-byte to 4-byte integer
111 (documentation not found - maybe in Stata 7/SE manual if this exists)
- Variable type codes changed to increase maximum size string variable type to str244
113 (documented in on-line help)
- Maximum value range reduced for integer types
- Allow multiple missing value codes (., .a .. .z) to be supported
114 (documented in on-line help)
- Storage for variable format information increased to 49 characters (including null terminator)
115 (documented in on-line help)
- Same as 114 (version number increased due to introduction of %tb business date format)
117 (documented in on-line help)
- Tagged xml-style structure
- Timestamp and dataset labels store their length as first byte, and no longer include null terminator
- Stores 8-byte location map for each component of the file structure (after initial release the "varlabs" field was set to zero, but fixed in a subsequent update)
- Variable type codes changed to increase maximum fixed string variable type to str2045 and introduce strL types
- Removes generic expansion fields and replaces them with "characteristics" component
- GSO v field stored as 4-byte integer
- GSO o field stored as 4-byte integer
- (v,o) for strL variables packed as (4,4) bytes
118 (documented in on-line help)
- Number of observations stored in 8-byte integer
- Dataset label length stored in 2-byte integer
- Storage for variable format information increased to 57 characters (including null terminator)
- Strings are now stored as UTF-8, increasing allocation reserved for each character from one byte to four bytes (still null terminated with single character)
- GSO o field increased to 8-byte integer
- (v,o) for strL variables packed as (2,6) bytes
119 (documented in on-line help)
- Number of variables stored in 4-byte integer (as a consequence srtlist records increase from 2-byte to 4-byte integers)
- (v,o) for strL variables packed as (3,5) bytes
120 (documented in on-line help)
- As 118, but adds alias variable type
121 (documented in on-line help)
- As 119, but adds alias variable type
Can you rebase and ping on green?
I have now rebased this, and all checks pass.
Thanks. LGTM.