duckdb-r icon indicating copy to clipboard operation
duckdb-r copied to clipboard

feat: Throw exception when non-utf8 characters are in a data.frame

Open Tmonster opened this issue 1 year ago • 6 comments

The duckdb engine assumes all strings are valid utf-8. In the R-client, we forgot to check if strings were in fact utf8. Here we check them when the scanning the df when we register it.

Closes #12.

Tmonster avatar Sep 22 '23 13:09 Tmonster

Welcome to Codecov :tada:

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered :open_umbrella:

codecov-commenter avatar Sep 22 '23 13:09 codecov-commenter

Thanks. Are there any other locations where we need to care about this? Thinking mainly about parameter binding via dbBind() or dbSendQuery(params = ...) .

krlmlr avatar Nov 08 '23 12:11 krlmlr

Coming back to this.

What are the costs of this preemptive check? I assume that the code must read the entire string? What are the consequences of avoiding this check?

Example for constructing a broken string:

bork <- iconv("börk", to = "latin1")
Encoding(bork) <- "UTF-8"
bork
#> [1] "b\xf6rk"

Created on 2024-03-11 with reprex v2.1.0

krlmlr avatar Mar 11 '24 04:03 krlmlr

There's a faster way:

x <- "Est\xe2ncia"
Encoding(x)
#> [1] "unknown"
x <- "Estância"
Encoding(x)
#> [1] "UTF-8"
x <- enc2utf8("Estância")
Encoding(x)
#> [1] "UTF-8"
Encoding(iconv("Est\xe2ncia", from = "latin1", to = "UTF-8"))
#> [1] "UTF-8"

Created on 2024-03-21 with reprex v2.1.0

This shows different results when ran line by line in the RStudio IDE: the second example also yields "unknown". I propose to do the fast check first, and only use utf8proc if that returns "unknown".

krlmlr avatar Mar 21 '24 07:03 krlmlr

I agree, I think the fast check first is better. Otherwise we have to copy the whole string. Will work on this now

Tmonster avatar Mar 26 '24 12:03 Tmonster

I actually wonder now if the issue is in the function enc2utf8. In register.R we encode the values of the data frame in the function encode_values(). enc2utf8 is applied to all character columns in the data frame that is passed. Playing around with the enc2utf8 function, there seem to be some inconsistencies when the encoding of the passed string is unknown.

> enc2utf8("Est\xe2ncia")
# [1] "Est\xe2ncia"
> iconv(enc2utf8("Est\xe2ncia"), from="latin1", to="UTF-8")
# [1] "Estância"
> iconv(enc2utf8("Est\xe2ncia"), from="UTF-8", to="UTF-8")
# [1] NA
> iconv(enc2utf8("hello"), from="UTF-8", to="UTF-8")
# [1] "hello"
> iconv(enc2utf8("hello"), from="latin1", to="UTF-8")
# [1] "hello"

Now I just check that Encoding(x) is valid for every value of a varcher column. This seems like a lot of overhead though, so open to other ideas.

EDIT: changes for readability

Tmonster avatar Mar 26 '24 14:03 Tmonster