BUG: Unable to open Stata 118 or 119 files saved in big-endian format that contain strL data
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
# Both of the following lines fail (the data files are provided in the issue description)
df = pd.read_stata("stata12_be_118.dta")
df = pd.read_stata("stata12_be_119.dta")
Issue Description
If I attempt to open a 118 format file saved in big-endian format that contains strL data I get the following error:
>>> import pandas as pd
>>> df = pd.read_stata("stata12_be_118.dta")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 2109, in read_stata
return reader.read()
^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1775, in read
data = self._insert_strls(data)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1876, in _insert_strls
data.isetitem(i, [self.GSO[str(k)] for k in data.iloc[:, i]])
~~~~~~~~^^^^^^^^
KeyError: '844424930131969'
The same is true if I repeat this for a 119 format file:
>>> import pandas as pd
>>> df = pd.read_stata("stata12_be_119.dta")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 2109, in read_stata
return reader.read()
^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1775, in read
data = self._insert_strls(data)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1876, in _insert_strls
data.isetitem(i, [self.GSO[str(k)] for k in data.iloc[:, i]])
~~~~~~~~^^^^^^^^
KeyError: '3298534883329'
The equivalent 117 format file works fine:
>>> import pandas as pd
>>> df = pd.read_stata("stata12_be_117.dta")
>>> df
x y z
0 1.0 abc abcdefghi
1 3.0 cba qwertywertyqwerty
2 93.0 strl
This occurs due to a failed lookup for the strL value due to a mismatch in the expected key for the following reason:
strL content is stored separately to the main data in records identified by a (v, o) - (variable, observation) value. This is then referenced from the main data to associate a particular string value with a position in the data. In format 117 v and o were both stored in 4 bytes and there was an exact match between the value stored in the main data and in the strL records. Format 118 and later increased o to be stored in 8 bytes, allowing more observations to be held in the data, however it did not change the storage size in the main data for referencing this, resulting in a need for a packed storage value where some of the high bytes were removed from v and o to allow both values to fit in 8 bytes. Using the notation of letters to represent bytes in v and numbers to represent bytes in o this means that the (v, o) index:
(ABCD, 12345678) would be referenced in 118 by: AB123456 and in 119 by: ABC12345
In big-endian format: (DCBA, 87654321) would be reference in 118 by: BA654321 and in 119 by: CBA54321
When looking up values Pandas takes the approach of converting (v, o) in the strL records into the packed form and treating this as an 8-byte integer, rather than expanding out the values in the data into separate 4 and 8-byte integers. The current code branch for little-endian gives the expected result:
>>> buf = 'ABCD12345678'
>>> v_size = 2 # 118 format
>>> buf[0:v_size] + buf[4 : (12 - v_size)]
'AB123456'
>>> v_size = 3 # 119 format
>>> buf[0:v_size] + buf[4 : (12 - v_size)]
'ABC12345'
however the big-endian path is incorrect:
>>> buf = 'DCBA87654321'
>>> v_size = 2 # 118 format
>>> buf[0:v_size] + buf[(4 + v_size) :]
'DC654321'
>>> v_size = 3 # 119 format
>>> buf[0:v_size] + buf[(4 + v_size) :]
'DCB54321'
it should instead be:
>>> buf = 'DCBA87654321'
>>> v_size = 2 # 118 format
>>> buf[4 - v_size:4] + buf[(4 + v_size) :]
'BA654321'
>>> v_size = 3 # 119 format
>>> buf[4 - v_size:4] + buf[(4 + v_size) :]
'CBA54321'
Once the packed value has been determined it should have byteorder applied, as this happens to the main data.
Expected Behavior
I would expect the file to load successfully, as it does in Stata:
. use "stata12_be_118.dta"
. list
+------------------------------+
| x y z |
|------------------------------|
1. | 1 abc abcdefghi |
2. | 3 cba qwertywertyqwerty |
3. | 93 strl |
+------------------------------+
. use "stata12_be_119.dta"
. list
+------------------------------+
| x y z |
|------------------------------|
1. | 1 abc abcdefghi |
2. | 3 cba qwertywertyqwerty |
3. | 93 strl |
+------------------------------+
Installed Versions
INSTALLED VERSIONS
commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.3.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United Kingdom.1252
pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 69.5.1 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 5.2.1 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.4 numba : None numexpr : 2.10.0 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None