pyreadstat icon indicating copy to clipboard operation
pyreadstat copied to clipboard

String encoding with read_por()?

Open edsu opened this issue 1 year ago • 17 comments

I'm trying to read an SPSS POR file which has an unknown encoding. Unfortunately it doesn't seem to make any difference in the metadata when I adjust the encoding I give to read_por()?

>>> df1, meta1 = pyreadstat.read_por('ukr95002.por')
>>> meta1.column_labels[0:10]
['1. ����� ������', '1. ��-�����, �� �� ���� �����, ������ � ����� ������ � �����', '2. �������, ���� �����, ��� ��_���� ����������   ����', '2. �������, ���� �����, ��� ��_�����������', '2. �������, ���� �����, ��� ��_���� �� ��������', '2. �������, ���� �����, ��� ��_���� �� ���������', '2. �������, ���� �����, ��� ��_���� �� ����', '2. �������, ���� �����, ��� ��_��������� �����-  ��������� �', '2. �������, ���� �����, ��� ��_����������������  ������', '2. �������, ���� �����, ��� ��_���������� ������� � �������']
>>> df2, meta2 = pyreadstat.read_por('ukr95002.por', encoding='WINDOWS-1251')
>>> meta2.column_labels[0:10]
['1. ����� ������', '1. ��-�����, �� �� ���� �����, ������ � ����� ������ � �����', '2. �������, ���� �����, ��� ��_���� ����������   ����', '2. �������, ���� �����, ��� ��_�����������', '2. �������, ���� �����, ��� ��_���� �� ��������', '2. �������, ���� �����, ��� ��_���� �� ���������', '2. �������, ���� �����, ��� ��_���� �� ����', '2. �������, ���� �����, ��� ��_��������� �����-  ��������� �', '2. �������, ���� �����, ��� ��_����������������  ������', '2. �������, ���� �����, ��� ��_���������� ������� � �������']
>>> meta1.column_labels == meta2.column_labels
True

I expected the labels to be different, even if I didn't guess the correct encoding? I've attached the file in question in a ZIP.

I installed pyreadstat v1.2.6 with pip in a pipenv environment Python 3.10.1) on macOS.

ukr95002.zip

edsu avatar Feb 06 '24 16:02 edsu

I think the behavior is correct. The ? symbols mean it could not make sense of it, if you change the encoding and you still see the ? It simply means it could not make sense of it again.

ofajardo avatar Feb 15 '24 19:02 ofajardo

Ok, that's good to know that it appears to be working. I'll try cycling through all known encodings and see if one of them works I guess...

edsu avatar Feb 15 '24 19:02 edsu

Yes, that sounds like a good plan

ofajardo avatar Feb 15 '24 19:02 ofajardo

Hmm, I cycled through all of these and could not find any that works. maybe the file is so old that Readstat is not reading it correctly. Maybe reading it in SPSS and exporting it as a newer format? I think reading it in SPSS would be good to check that the file is not corrupt.

ofajardo avatar Feb 16 '24 12:02 ofajardo

Apparently the file is from Ukraine in mid-to-late 90s so I wonder if SPSS could even open it. That would be impressive if it was the case!

edsu avatar Feb 16 '24 16:02 edsu

yep, I was just looking in the hex editor and x-cyrillic-mac and IBM866 can kind of make sense of it partially, so it must be something like that, I tested these two and a few other cyrillic ones and none of them worked with pyreadstat unfortunately.

ofajardo avatar Feb 16 '24 16:02 ofajardo

I found this list, which is the list of encodings that iconv, the underlying library doing the conversion supports. In the European languages there are only a few ... you could try them all, I think if none of those work, then probably you are out of luck: https://www.gnu.org/software/libiconv/

ofajardo avatar Feb 16 '24 16:02 ofajardo

Another option, depending on how important this is, and how much error you can tolerate, is to open the file with a text/hex editor tgat supports those old ukrainian encodings, then you will see the labels in clear text and its associated value, then you will have to assume they are ordered as the variables and build the map manually.

ofajardo avatar Feb 16 '24 17:02 ofajardo

I was looking at the ReadStat source code and now I think you were right from the beginning, there is apparently no encoding conversion enabled for por files, while it is for other file types, sav for instance: https://github.com/WizardMac/ReadStat/blob/4926250c8d7d8793153d7d8552a96f130eb68937/src/spss/readstat_por_read.c

I guess this is because other encodings are rare for por files.

It means unfortunately Readstat and therefore pyreadstat does not support other encodings for por files. You could file an issue in ReadStat for it to be implemented in the future, but I guess it is going to take a long while. Yet another question is whether iconv would support this specific ukrainian encoding.

ofajardo avatar Feb 16 '24 20:02 ofajardo

Oh actually, maybe it is possible to transcode the labels in python, now that you know they are currently utf8 and you need to translate them to some cyrillic encoding.

ofajardo avatar Feb 16 '24 22:02 ofajardo

I was wondering about that too! But I wasn't quite sure if reading as utf8 into strings munged the data in such a way that it was no longer possible to decode them?

edsu avatar Feb 16 '24 22:02 edsu

probably you are right again =(

ofajardo avatar Feb 16 '24 22:02 ofajardo

Beyond the encoding issue, I think the labels are not read correctly by ReadStat, for all labels I see always the same two bytes, repeated over and over, which does not make sense. This before converting them to utf8. The byte sequence is not to be found in the file. Also As you expected SPSS cannot make sense of the labels either.

ofajardo avatar Feb 17 '24 22:02 ofajardo

I tried with Readstat itself with the same results. Debugging it with gdb, I can see that here I can see a string (tried to check if they did make sense with a cyrillic encoding, and they did not, but not sure if I am doing it right), but later at line 523 it gets converted and the string is messed up, probably because the encoding is set by default to utf-8 and for por files there is no way at the moment to change it. So from the python side nothing can be done at the moment. One could request that they add encoding conversion to Readstat to see if that helps, but it may still not solve the issue in this particular case as the right encoding seems to be elusive.

ofajardo avatar Feb 18 '24 09:02 ofajardo

By debugging I got to the point where it is reading the bytes from the file, then I see some code where these bytes are forced to be transformed to utf-8 (this is hardcoded, not using iconv), so it seems the whole program is really designed to read utf8 only.

I think on my side I will remove the encoding option from read_por as it is ineffective.

ofajardo avatar Feb 18 '24 13:02 ofajardo

Thank you for taking the time to dig all the way into this. It was much appreciated! Do you think it is worth opening a ticket in ReadStat? It is such a niche corner case.

edsu avatar Feb 20 '24 14:02 edsu

Good question, I am not sure myself, so up to you.

In one hand it looks like a fair request, or at least one that should be documented as an issue (so that others in the future would know), particularly since you have a test file, which is difficult to find I guess.

In the other hand, it seems to be a rare case. If the effort to implement is low maybe worth doing it, but it seems to me that it is more involved (I could not understand very well what is going on, and it seems there is a function that assumes the data is in utf8) and in that case they may never solve it.

ofajardo avatar Feb 21 '24 10:02 ofajardo

in the new version 1.2.7 I have removed the encoding parameter from the read_por function and explained in the documentation of the function that only UTF-8 por files are supported. That what I can do for now. In case you (or somebody else) would open a request in Readstat, let me know.

ofajardo avatar Mar 14 '24 14:03 ofajardo