BUG: Unable to open Stata 118 or 119 files saved in big-endian format that contain strL data

Open cmjcharlton opened this issue 1 year ago • 0 comments

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
# Both of the following lines fail (the data files are provided in the issue description)
df = pd.read_stata("stata12_be_118.dta")
df = pd.read_stata("stata12_be_119.dta")

Issue Description

If I attempt to open a 118 format file saved in big-endian format that contains strL data I get the following error:

>>> import pandas as pd
>>> df = pd.read_stata("stata12_be_118.dta")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 2109, in read_stata
    return reader.read()
           ^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1775, in read
    data = self._insert_strls(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1876, in _insert_strls
    data.isetitem(i, [self.GSO[str(k)] for k in data.iloc[:, i]])
                      ~~~~~~~~^^^^^^^^
KeyError: '844424930131969'

The same is true if I repeat this for a 119 format file:

>>> import pandas as pd
>>> df = pd.read_stata("stata12_be_119.dta")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 2109, in read_stata
    return reader.read()
           ^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1775, in read
    data = self._insert_strls(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1876, in _insert_strls
    data.isetitem(i, [self.GSO[str(k)] for k in data.iloc[:, i]])
                      ~~~~~~~~^^^^^^^^
KeyError: '3298534883329'

The equivalent 117 format file works fine:

>>> import pandas as pd
>>> df = pd.read_stata("stata12_be_117.dta")
>>> df
      x    y                  z
0   1.0  abc          abcdefghi
1   3.0  cba  qwertywertyqwerty
2  93.0                    strl

This occurs due to a failed lookup for the strL value due to a mismatch in the expected key for the following reason:

strL content is stored separately to the main data in records identified by a (v, o) - (variable, observation) value. This is then referenced from the main data to associate a particular string value with a position in the data. In format 117 v and o were both stored in 4 bytes and there was an exact match between the value stored in the main data and in the strL records. Format 118 and later increased o to be stored in 8 bytes, allowing more observations to be held in the data, however it did not change the storage size in the main data for referencing this, resulting in a need for a packed storage value where some of the high bytes were removed from v and o to allow both values to fit in 8 bytes. Using the notation of letters to represent bytes in v and numbers to represent bytes in o this means that the (v, o) index:

(ABCD, 12345678) would be referenced in 118 by: AB123456 and in 119 by: ABC12345

In big-endian format: (DCBA, 87654321) would be reference in 118 by: BA654321 and in 119 by: CBA54321

When looking up values Pandas takes the approach of converting (v, o) in the strL records into the packed form and treating this as an 8-byte integer, rather than expanding out the values in the data into separate 4 and 8-byte integers. The current code branch for little-endian gives the expected result:

>>> buf = 'ABCD12345678'
>>> v_size = 2 # 118 format
>>> buf[0:v_size] + buf[4 : (12 - v_size)]
'AB123456'
>>> v_size = 3 # 119 format
>>> buf[0:v_size] + buf[4 : (12 - v_size)]
'ABC12345'

however the big-endian path is incorrect:

>>> buf = 'DCBA87654321'
>>> v_size = 2 # 118 format
>>> buf[0:v_size] + buf[(4 + v_size) :]
'DC654321'
>>> v_size = 3 # 119 format
>>> buf[0:v_size] + buf[(4 + v_size) :]
'DCB54321'

it should instead be:

>>> buf = 'DCBA87654321'
>>> v_size = 2 # 118 format
>>> buf[4 - v_size:4] + buf[(4 + v_size) :]
'BA654321'
>>> v_size = 3 # 119 format
>>> buf[4 - v_size:4] + buf[(4 + v_size) :]
'CBA54321'

Once the packed value has been determined it should have byteorder applied, as this happens to the main data.

stata12_be.zip

Expected Behavior

I would expect the file to load successfully, as it does in Stata:

. use "stata12_be_118.dta"

. list

     +------------------------------+
     |  x     y                   z |
     |------------------------------|
  1. |  1   abc           abcdefghi |
  2. |  3   cba   qwertywertyqwerty |
  3. | 93                      strl |
     +------------------------------+

. use "stata12_be_119.dta"

. list

     +------------------------------+
     |  x     y                   z |
     |------------------------------|
  1. |  1   abc           abcdefghi |
  2. |  3   cba   qwertywertyqwerty |
  3. | 93                      strl |
     +------------------------------+

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.3.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United Kingdom.1252

pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 69.5.1 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 5.2.1 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.4 numba : None numexpr : 2.10.0 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

May 08 '24 15:05 cmjcharlton