String encoding with read_por()?
I'm trying to read an SPSS POR file which has an unknown encoding. Unfortunately it doesn't seem to make any difference in the metadata when I adjust the encoding I give to read_por()?
>>> df1, meta1 = pyreadstat.read_por('ukr95002.por')
>>> meta1.column_labels[0:10]
['1. ����� ������', '1. ��-�����, �� �� ���� �����, ������ � ����� ������ � �����', '2. �������, ���� �����, ��� ��_���� ���������� ����', '2. �������, ���� �����, ��� ��_�����������', '2. �������, ���� �����, ��� ��_���� �� ��������', '2. �������, ���� �����, ��� ��_���� �� ���������', '2. �������, ���� �����, ��� ��_���� �� ����', '2. �������, ���� �����, ��� ��_��������� �����- ��������� �', '2. �������, ���� �����, ��� ��_���������������� ������', '2. �������, ���� �����, ��� ��_���������� ������� � �������']
>>> df2, meta2 = pyreadstat.read_por('ukr95002.por', encoding='WINDOWS-1251')
>>> meta2.column_labels[0:10]
['1. ����� ������', '1. ��-�����, �� �� ���� �����, ������ � ����� ������ � �����', '2. �������, ���� �����, ��� ��_���� ���������� ����', '2. �������, ���� �����, ��� ��_�����������', '2. �������, ���� �����, ��� ��_���� �� ��������', '2. �������, ���� �����, ��� ��_���� �� ���������', '2. �������, ���� �����, ��� ��_���� �� ����', '2. �������, ���� �����, ��� ��_��������� �����- ��������� �', '2. �������, ���� �����, ��� ��_���������������� ������', '2. �������, ���� �����, ��� ��_���������� ������� � �������']
>>> meta1.column_labels == meta2.column_labels
True
I expected the labels to be different, even if I didn't guess the correct encoding? I've attached the file in question in a ZIP.
I installed pyreadstat v1.2.6 with pip in a pipenv environment Python 3.10.1) on macOS.
I think the behavior is correct. The ? symbols mean it could not make sense of it, if you change the encoding and you still see the ? It simply means it could not make sense of it again.
Ok, that's good to know that it appears to be working. I'll try cycling through all known encodings and see if one of them works I guess...
Yes, that sounds like a good plan
Hmm, I cycled through all of these and could not find any that works. maybe the file is so old that Readstat is not reading it correctly. Maybe reading it in SPSS and exporting it as a newer format? I think reading it in SPSS would be good to check that the file is not corrupt.
Apparently the file is from Ukraine in mid-to-late 90s so I wonder if SPSS could even open it. That would be impressive if it was the case!
yep, I was just looking in the hex editor and x-cyrillic-mac and IBM866 can kind of make sense of it partially, so it must be something like that, I tested these two and a few other cyrillic ones and none of them worked with pyreadstat unfortunately.
I found this list, which is the list of encodings that iconv, the underlying library doing the conversion supports. In the European languages there are only a few ... you could try them all, I think if none of those work, then probably you are out of luck: https://www.gnu.org/software/libiconv/
Another option, depending on how important this is, and how much error you can tolerate, is to open the file with a text/hex editor tgat supports those old ukrainian encodings, then you will see the labels in clear text and its associated value, then you will have to assume they are ordered as the variables and build the map manually.
I was looking at the ReadStat source code and now I think you were right from the beginning, there is apparently no encoding conversion enabled for por files, while it is for other file types, sav for instance: https://github.com/WizardMac/ReadStat/blob/4926250c8d7d8793153d7d8552a96f130eb68937/src/spss/readstat_por_read.c
I guess this is because other encodings are rare for por files.
It means unfortunately Readstat and therefore pyreadstat does not support other encodings for por files. You could file an issue in ReadStat for it to be implemented in the future, but I guess it is going to take a long while. Yet another question is whether iconv would support this specific ukrainian encoding.
Oh actually, maybe it is possible to transcode the labels in python, now that you know they are currently utf8 and you need to translate them to some cyrillic encoding.
I was wondering about that too! But I wasn't quite sure if reading as utf8 into strings munged the data in such a way that it was no longer possible to decode them?
probably you are right again =(
Beyond the encoding issue, I think the labels are not read correctly by ReadStat, for all labels I see always the same two bytes, repeated over and over, which does not make sense. This before converting them to utf8. The byte sequence is not to be found in the file. Also As you expected SPSS cannot make sense of the labels either.
I tried with Readstat itself with the same results. Debugging it with gdb, I can see that here I can see a string (tried to check if they did make sense with a cyrillic encoding, and they did not, but not sure if I am doing it right), but later at line 523 it gets converted and the string is messed up, probably because the encoding is set by default to utf-8 and for por files there is no way at the moment to change it. So from the python side nothing can be done at the moment. One could request that they add encoding conversion to Readstat to see if that helps, but it may still not solve the issue in this particular case as the right encoding seems to be elusive.
By debugging I got to the point where it is reading the bytes from the file, then I see some code where these bytes are forced to be transformed to utf-8 (this is hardcoded, not using iconv), so it seems the whole program is really designed to read utf8 only.
I think on my side I will remove the encoding option from read_por as it is ineffective.
Thank you for taking the time to dig all the way into this. It was much appreciated! Do you think it is worth opening a ticket in ReadStat? It is such a niche corner case.
Good question, I am not sure myself, so up to you.
In one hand it looks like a fair request, or at least one that should be documented as an issue (so that others in the future would know), particularly since you have a test file, which is difficult to find I guess.
In the other hand, it seems to be a rare case. If the effort to implement is low maybe worth doing it, but it seems to me that it is more involved (I could not understand very well what is going on, and it seems there is a function that assumes the data is in utf8) and in that case they may never solve it.
in the new version 1.2.7 I have removed the encoding parameter from the read_por function and explained in the documentation of the function that only UTF-8 por files are supported. That what I can do for now. In case you (or somebody else) would open a request in Readstat, let me know.