librdata icon indicating copy to clipboard operation
librdata copied to clipboard

Add ability to specify text encoding or disable transcoding

Open ofajardo opened this issue 4 years ago • 13 comments

hi

It was reported here in pyreadr that trying to open this file raises the following error:

Unable to convert string to the requested encoding (invalid byte sequence)

i.e RDATA_ERROR_CONVERT_BAD_STRING

Looking at the first 30 bytes of the files I got the impression the file is in CP1252 (maybe I am looking at a completely wrong pace, I actually don't know how this file is structured):

RDX3\nX\n\x00\x00\x00\x03\x00\x03\x06\x01\x00\x03\x05\x00\x00\x00\x00\x06CP1252\x00

Looking at the source code I was expecting to get RDATA_ERROR_UNSUPPORTED_CHARSET instead. Maybe librdata is not extracting the encoding correctly for this file?

And actually, would it be possible to support non UTF-8 files?

thanks!

ofajardo avatar Feb 01 '21 09:02 ofajardo

Hi, I will need an updated link to the test file, as it appears to have been deleted from Dropbox.

evanmiller avatar Mar 27 '21 13:03 evanmiller

Asking the original reporter to upload the file again ...

ofajardo avatar Mar 27 '21 15:03 ofajardo

One possibility is that the file self-reports as CP1252, but contains strings in another encoding. This would produce the BAD_STRING error.

evanmiller avatar Mar 28 '21 13:03 evanmiller

Here the file ...

https://github.com/ofajardo/readstat_test_files/blob/master/tip2020.rda

ofajardo avatar Mar 29 '21 13:03 ofajardo

Debugging a bit I am seeing this 11-byte hex string stored in a string vector:

\x81\x84\xe3\x81\x84\xe3\x81\xad\x5e\x5f\x5e

Not sure what this is supposed to be, but \x81 is unused by Code Page 1252. As a workaround I can add //IGNORE to the iconv command to skip unrecognized characters, but this might produce unexpected output.

evanmiller avatar Mar 29 '21 14:03 evanmiller

Looking through the file, the strings looks like nonsense - so I am wondering if the real encoding is something non-ASCII-based. It would help to have more information about where this file came from.

evanmiller avatar Mar 29 '21 14:03 evanmiller

@69hed could you please provide more information on how this file was generated/where it comes from?

Looking at it in R, it looks OK, interestingly it says that for most character values the encoding is "unknown", but some of them are UTF-8 (see arrow) And there are a few nonsense values as well (few).

image

Looking at the content my guess would be that it is coming from an online survey/feedback webpage, where the user is allowed to type whatever, or copy paste, giving you inconsistent encodings across the same field (I have seen such situation before) ...

ofajardo avatar Mar 29 '21 14:03 ofajardo

more examples of values in the "text" column with international characters. Some values appear to have only ascii characters:

[954] "$7 Saké Wed Nights"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 [955] "Two visits, two phenomenal sandwiches. The seasonal jalapeño with corn crema and the egg roll were perfect. Love this place!"                                                                                                                                                                                                                                                                                                                                                                                                    
 [956] "Does mot spécialisé in iced tea"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
 [957] "Sitzplatzempfehlung für freien Blick zur Bühne Tisch 12 Platz 1&2"                                                                                                                                                                                                                                                                                                                                                                                                                                                               
 [958] "Je hungré fo some frieeesss!"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
 [959] "Coupon in VEGAS2GO® guide offers free ticket with the purchase of one. There is also an active Yelp Deal that offers a (not as good) discount if you cannot locate a guide. - E"                                                                                                                                                                                                                                                                                                                                                 
 [960] "服务员是真的傻逼,加个座位会挡住你走路,怕是挡住你的棺材路哦,纯几把傻"  

ofajardo avatar Mar 29 '21 14:03 ofajardo

@ofajardo The additional context helps - I guess it will be mostly UTF-8 even though the file header indicates CP1252. I'm not sure what the correct behavior is on the librdata side. Maybe provide an encoding override or the ability to request no recoding (similar to the ReadStat API).

evanmiller avatar Mar 29 '21 14:03 evanmiller

I think that makes sense

ofajardo avatar Mar 29 '21 15:03 ofajardo

@ofajardo All right - I will change this issue to an "enhancement" and leave it open since the library is currently behaving as expected for the provided file.

evanmiller avatar Mar 29 '21 15:03 evanmiller

thanks!

ofajardo avatar Mar 29 '21 15:03 ofajardo

my personal preference would be to allow specifying the encoding (I think that's what Readstat does?) ... because on the python side I am expecting UTF-8. The user could loop through a bunch of encodings to see which one does the job

ofajardo avatar Mar 29 '21 15:03 ofajardo