Handle invalid multi-byte sequences in iconv encoding conversions
Hi @evanmiller,
This PR is another go at #252 (cc tidyverse/haven#615).
As suggested I've reworked the bad byte handler so it's called from surrounding code when a conversion fails instead of processing bytes inside readstat_convert(). I've implemented the handler in the string conversion code in sav_process_row() as an example, and there are a handful of bad byte handlers in readstat_convert.c.
The handler signature includes the src and dst variables from readstat_convert() as well as the observation index and variable object for error logging.
Thanks!
Hi @evanmiller! Just wondering if you've had a chance to look at this?
Would you prefer that I finalise the full PR before reviewing? As noted I've only implemented the handler in the code for processing sav string values but I can chuck it in to the others as well and non-draft the PR if that's easier.
Thanks!
Hi, I like the overall approach. Maybe rename to invalid_string instead of bad_byte? My thinking is that the handlers are processing the entire strings rather than individual bytes.
Thanks! Good call, shall do 🙂
I'll make that change then add the handler in to the other reader functions and update the PR.
I've renamed the handler to invalid_string and added the handler into the row processing code for each of the readers, so should be good to go (i.e. ready for checking :stuck_out_tongue:)!
Ok, looking good. Since this introduces a new API I will save it for the next minor version bump (ReadStat 1.2). Can you add some documentation (at least to the header file) about how C clients can use the new machinery? I don't think it's really obvious.
I'm also not sure that USER_ABORT is the right return code here since most of the handlers are system-provided rather than user-provided. Maybe a failed conversion should return BAD_STRING or perhaps a new error code.
No worries, I'll add a comment in the header. Given the weirdness it probably warrants a write-up with an example in the "Reading files" section of the README for some context. Do you think it needs a full example or is that unnecessary bloat?
Fair call re: the error message. I think READSTAT_ERROR_CONVERT_BAD_STRING is still appropriate given that it's still essentially an invalid byte sequence error, but I can see value in signalling the error differently when it has failed after going through the handler. What about a READSTAT_ERROR_INVALID_STRING_HANDLERwith an error message like "Unable to convert string to the requested encoding (could not recover from invalid byte sequence in the invalid string handler)"?
Hi @evanmiller, just wondering if you had any feedback on the last comment?
This is the way I'm leaning:
- As well as the header comment, also add a new subsection to the README with a brief example
- Just return
READSTAT_ERROR_CONVERT_BAD_STRINGinstead ofUSER_ABORTif the handler returnsREADSTAT_HANDLER_ABORTinstead of adding a new error type.
Sorry for nudging...just curious about this bug and its fix, since it's been quite for some time now. I got hit again by the "invalid byte sequence" bug for another of my .sav files where currently the only solution is to switch back to R package haven with version <= 2.3.1.