odbc
odbc copied to clipboard
Encode non-ASCII Column Names to UTF-8
Closes #430 Closes #432
Code
x <- DBI::dbGetQuery(con, "select '汉语' as '中文'")
x
Encoding(x[[1]])
Encoding(colnames(x))
DBI::dbGetQuery(con, "select '汉语' from '中文'”)
Before the PR

After the PR

Looks good -
Should we add some documentation in odbc_result.h
? Character enconding always confuses the hell out of me.
@shrektan Thanks for writing up the documentation; I was able to follow along better. Nice work again.
This is very minor, but I was wondering about this:
Rather than introducing to_utf8
where we guard the usage of output_encoder_
with that if c_->encoding() == ""
statement; how about we always call output_encoder_
, unconditionally - both in assign_string
, as well as in column_names
?
As far as I can tell in Iconv.cpp
, both output_encoder_.makeSEXP
and output_encoder_.makeString
are pretty smart / should just pass through to Rf_mkCharLenCE(...., CE_UTF8)
if no conversion is needed (UTF8->UTF8). The latter one, output_encoder_.makeString
might be easiest to use in column_names
and would perhaps allow you to stay with the current data types.
We would probably need something like this in the constructor:
output_encoder_(Iconv(c_->encoding() == "" ? "UTF-8" : c_->encoding(), "UTF-8"))
The way I see it, this would save us from having character encoding helper methods both in odbc_result
as well as the object in Iconv
.
What do you think?
@shrektan any thought on @detule's last comment above? I have to do a odbc release to fix a compilation warning for CRAN, it would be nice to get this in there as well, but it has to be submitted by tomorrow. If we can't get it done by then it can wait until the next CRAN release, so it is not a huge deal if you don't have time to work on it right now.
@jimhester Unfortunately, I don't have time for this right now. Let's merge this in the next CRAN release.
Hey @shrektan
Sorry this has been out here for so long; you did some excellent work here.
Hope you don't mind - I tried to see what the changes would look like if, rather than adding a to_utf8
routine in odbc_result
, we just unconditionally funneled the strings to output_encoder
, and let it decide if any conversion is needed.
What do you think?
Nice work again.
@detule It's great to see that these issues are starting to be addressed gradually. I've briefly reviewed your approach, and it looks good to me. Thanks.