RPostgreSQL
RPostgreSQL copied to clipboard
Encoding of strings is not set properly by dbGetQuery()
What steps will reproduce the problem?
1. Configure PostgreSQL client and server to use UTF-8 encoding (this is the
default). To verify,
dbGetQuery(con_master, "SHOW server_encoding");
# "UTF-8"
dbGetQuery(con_master, "SHOW client_encoding");
# "UTF-8"
2. Create a UTF-8 string. One method is to start with a latin1 string created
in Microsoft Word, then converted to UTF-8 via iconv():
a <- "Test UTF-8: ±€£¥©®™≠≤≥÷×∞µαβπΩ∑"
aa <- iconv(a, from="latin1", to="UTF-8");
Encoding(aa);
# "UTF-8"
3. Write this string to a PostgreSQL table using dbGetQuery or dbWriteTable.
For example:
b <- data.frame(col1=c("simple string", aa), col2=c(1,2), stringsAsFactors=FALSE);
Encoding(b$col1)
# [1] "unknown" "UTF-8"
dbWriteTable(con, "junk", b, overwrite=TRUE, append=FALSE);
4. Read the table back using dbGetQuery for example:
c <- dbGetQuery(con, "SELECT * FROM junk");
If you check the encoding you will see that it is set to "unknown"
Encoding(c);
# [1] "unknown" "unknown"
Note that this is incorrect - the second string should have been "UTF-8"
To fix this, you have to explicitly set the encoding on every string returned
by every PostgreSQL command that returns strings from the database. In the
above example you must follow the dbGetQuery() with:
Encoding(c$col1) <- "UTF-8"
Now the data.frame will be processed correctly.
What version of the product are you using? On what operating system?
Windows 7, R 2.9.2, RPostgreSQL 0.4, working inside Eclipse IDE 4.2.1 with
StatET
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 24 May 2013 at 12:00
What is the status of this? This is still an issue.
library(RPostgreSQL)
#> Lade nötiges Paket: DBI
(x_df <- data.frame(x = c("ä", "ö", "ü"), stringsAsFactors = FALSE))
#> x
#> 1 ä
#> 2 ö
#> 3 ü
pg_con <-
dbConnect(
RPostgreSQL::PostgreSQL(),
# my credentials
)
dbGetQuery(pg_con, "SHOW server_encoding")
#> server_encoding
#> 1 UTF8
dbWriteTable(pg_con, "x_df", x_df, overwrite = TRUE)
#> [1] TRUE
(x_db <- dbReadTable(pg_con, "x_df"))
#> x
#> 1 ä
#> 2 ö
#> 3 ü
Encoding(x_db$x)
#> [1] "unknown" "unknown" "unknown"
Encoding(x_db$x) <- "UTF-8"
x_db
#> x
#> 1 ä
#> 2 ö
#> 3 ü
dbDisconnect(pg_con)
#> [1] TRUE
Same problem applies to dbGetQuery()
See also https://github.com/rstats-db/DBI/issues/116
Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#> setting value
#> version R version 3.4.1 (2017-06-30)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate German_Germany.1252
#> tz Europe/Berlin
#> date 2017-09-11
#> Packages -----------------------------------------------------------------
#> package * version date source
#> backports 1.1.0 2017-05-22 CRAN (R 3.4.0)
#> base * 3.4.1 2017-06-30 local
#> compiler 3.4.1 2017-06-30 local
#> datasets * 3.4.1 2017-06-30 local
#> DBI * 0.7 2017-06-18 CRAN (R 3.4.0)
#> devtools 1.13.3 2017-08-02 CRAN (R 3.4.1)
#> digest 0.6.12 2017-01-27 CRAN (R 3.3.2)
#> evaluate 0.10.1 2017-06-24 CRAN (R 3.4.0)
#> graphics * 3.4.1 2017-06-30 local
#> grDevices * 3.4.1 2017-06-30 local
#> htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
#> knitr 1.17 2017-08-10 CRAN (R 3.4.1)
#> magrittr 1.5 2014-11-22 CRAN (R 3.3.0)
#> memoise 1.1.0 2017-05-29 Github (hadley/memoise@e372cde)
#> methods * 3.4.1 2017-06-30 local
#> Rcpp 0.12.12 2017-07-15 CRAN (R 3.4.1)
#> rmarkdown 1.6 2017-06-15 CRAN (R 3.4.0)
#> RPostgreSQL * 0.6-2 2017-06-24 CRAN (R 3.4.0)
#> rprojroot 1.2 2017-01-16 CRAN (R 3.3.2)
#> stats * 3.4.1 2017-06-30 local
#> stringi 1.1.5 2017-04-07 CRAN (R 3.3.3)
#> stringr 1.2.0 2017-02-18 CRAN (R 3.3.3)
#> tools 3.4.1 2017-06-30 local
#> utils * 3.4.1 2017-06-30 local
#> withr 2.0.0 2017-09-07 Github (jimhester/withr@eff4818)
#> yaml 2.1.14 2016-11-12 CRAN (R 3.3.2)
It is possible of course, that the server encoding is not UTF-8. A somewhat hacky solution to account for this would be to force the client encoding in R
to UTF-8, with postgresqlExecStatement("SET client_encoding = 'UTF-8'")
for example. Since R
may not support the the server encoding, "UTF-8" would be the common denominator.
Server encoding have nothing to do with this issue. Commit ca5182770aeab3950ef21d93029e3cac1be65e1a sets the encoding to the received text data according to the client encoding at the transfer.
I am testing since yesterday, looking good so far.
Still looking good, this can be closed, I think. Do you have plans to release an updated version to CRAN in the near future?