odbc icon indicating copy to clipboard operation
odbc copied to clipboard

Encode non-ASCII Column Names to UTF-8

Open shrektan opened this issue 4 years ago • 4 comments

Closes #430 Closes #432

Code

x <- DBI::dbGetQuery(con, "select '汉语' as '中文'")
x
Encoding(x[[1]])
Encoding(colnames(x))

DBI::dbGetQuery(con, "select '汉语' from '中文'”)

Before the PR

image

After the PR

image

shrektan avatar Jan 14 '21 08:01 shrektan

Looks good -

Should we add some documentation in odbc_result.h? Character enconding always confuses the hell out of me.

detule avatar Jan 14 '21 20:01 detule

@shrektan Thanks for writing up the documentation; I was able to follow along better. Nice work again.

This is very minor, but I was wondering about this:

Rather than introducing to_utf8 where we guard the usage of output_encoder_ with that if c_->encoding() == "" statement; how about we always call output_encoder_, unconditionally - both in assign_string, as well as in column_names?

As far as I can tell in Iconv.cpp, both output_encoder_.makeSEXP and output_encoder_.makeString are pretty smart / should just pass through to Rf_mkCharLenCE(...., CE_UTF8) if no conversion is needed (UTF8->UTF8). The latter one, output_encoder_.makeString might be easiest to use in column_names and would perhaps allow you to stay with the current data types.

We would probably need something like this in the constructor:

output_encoder_(Iconv(c_->encoding() == "" ? "UTF-8" : c_->encoding(), "UTF-8"))

The way I see it, this would save us from having character encoding helper methods both in odbc_result as well as the object in Iconv.

What do you think?

detule avatar Jan 16 '21 20:01 detule

@shrektan any thought on @detule's last comment above? I have to do a odbc release to fix a compilation warning for CRAN, it would be nice to get this in there as well, but it has to be submitted by tomorrow. If we can't get it done by then it can wait until the next CRAN release, so it is not a huge deal if you don't have time to work on it right now.

jimhester avatar Feb 25 '21 20:02 jimhester

@jimhester Unfortunately, I don't have time for this right now. Let's merge this in the next CRAN release.

shrektan avatar Feb 26 '21 16:02 shrektan

Hey @shrektan

Sorry this has been out here for so long; you did some excellent work here.

Hope you don't mind - I tried to see what the changes would look like if, rather than adding a to_utf8 routine in odbc_result, we just unconditionally funneled the strings to output_encoder, and let it decide if any conversion is needed.

What do you think?

Nice work again.

detule avatar Jan 14 '23 04:01 detule

@detule It's great to see that these issues are starting to be addressed gradually. I've briefly reviewed your approach, and it looks good to me. Thanks.

shrektan avatar Jan 14 '23 07:01 shrektan