ReadStat icon indicating copy to clipboard operation
ReadStat copied to clipboard

Handle invalid multi-byte sequences in iconv encoding conversions

Open gorcha opened this issue 3 years ago • 9 comments

Hi @evanmiller,

This PR is another go at #252 (cc tidyverse/haven#615).

As suggested I've reworked the bad byte handler so it's called from surrounding code when a conversion fails instead of processing bytes inside readstat_convert(). I've implemented the handler in the string conversion code in sav_process_row() as an example, and there are a handful of bad byte handlers in readstat_convert.c.

The handler signature includes the src and dst variables from readstat_convert() as well as the observation index and variable object for error logging.

Thanks!

gorcha avatar Feb 06 '22 10:02 gorcha

Hi @evanmiller! Just wondering if you've had a chance to look at this?

Would you prefer that I finalise the full PR before reviewing? As noted I've only implemented the handler in the code for processing sav string values but I can chuck it in to the others as well and non-draft the PR if that's easier.

Thanks!

gorcha avatar Mar 21 '22 12:03 gorcha

Hi, I like the overall approach. Maybe rename to invalid_string instead of bad_byte? My thinking is that the handlers are processing the entire strings rather than individual bytes.

evanmiller avatar Mar 21 '22 14:03 evanmiller

Thanks! Good call, shall do 🙂

I'll make that change then add the handler in to the other reader functions and update the PR.

gorcha avatar Mar 21 '22 23:03 gorcha

I've renamed the handler to invalid_string and added the handler into the row processing code for each of the readers, so should be good to go (i.e. ready for checking :stuck_out_tongue:)!

gorcha avatar Mar 22 '22 11:03 gorcha

Ok, looking good. Since this introduces a new API I will save it for the next minor version bump (ReadStat 1.2). Can you add some documentation (at least to the header file) about how C clients can use the new machinery? I don't think it's really obvious.

I'm also not sure that USER_ABORT is the right return code here since most of the handlers are system-provided rather than user-provided. Maybe a failed conversion should return BAD_STRING or perhaps a new error code.

evanmiller avatar Mar 22 '22 12:03 evanmiller

No worries, I'll add a comment in the header. Given the weirdness it probably warrants a write-up with an example in the "Reading files" section of the README for some context. Do you think it needs a full example or is that unnecessary bloat?

Fair call re: the error message. I think READSTAT_ERROR_CONVERT_BAD_STRING is still appropriate given that it's still essentially an invalid byte sequence error, but I can see value in signalling the error differently when it has failed after going through the handler. What about a READSTAT_ERROR_INVALID_STRING_HANDLERwith an error message like "Unable to convert string to the requested encoding (could not recover from invalid byte sequence in the invalid string handler)"?

gorcha avatar Mar 22 '22 12:03 gorcha

Hi @evanmiller, just wondering if you had any feedback on the last comment?

This is the way I'm leaning:

  • As well as the header comment, also add a new subsection to the README with a brief example
  • Just return READSTAT_ERROR_CONVERT_BAD_STRING instead of USER_ABORT if the handler returns READSTAT_HANDLER_ABORT instead of adding a new error type.

gorcha avatar May 16 '22 05:05 gorcha

Sorry for nudging...just curious about this bug and its fix, since it's been quite for some time now. I got hit again by the "invalid byte sequence" bug for another of my .sav files where currently the only solution is to switch back to R package haven with version <= 2.3.1.

deschen1 avatar Sep 13 '22 18:09 deschen1