RPostgreSQL icon indicating copy to clipboard operation
RPostgreSQL copied to clipboard

Encoding of strings is not set properly by dbGetQuery()

Open GoogleCodeExporter opened this issue 9 years ago • 5 comments

What steps will reproduce the problem?
1. Configure PostgreSQL client and server to use UTF-8 encoding (this is the 
default).  To verify, 
    dbGetQuery(con_master, "SHOW server_encoding");
       # "UTF-8"
    dbGetQuery(con_master, "SHOW client_encoding");
       # "UTF-8"

2. Create a UTF-8 string.  One method is to start with a latin1 string created 
in Microsoft Word, then converted to UTF-8 via iconv():
    a <- "Test UTF-8: ±€£¥©®™≠≤≥÷×∞µαβπΩ∑"
    aa <- iconv(a, from="latin1", to="UTF-8");
    Encoding(aa);
        # "UTF-8"

3. Write this string to a PostgreSQL table using dbGetQuery or dbWriteTable.  
For example:
    b <- data.frame(col1=c("simple string", aa), col2=c(1,2), stringsAsFactors=FALSE);
    Encoding(b$col1)
        # [1] "unknown" "UTF-8"  
    dbWriteTable(con, "junk", b, overwrite=TRUE, append=FALSE);

4. Read the table back using dbGetQuery for example:
    c <- dbGetQuery(con, "SELECT * FROM junk");

If you check the encoding you will see that it is set to "unknown"
    Encoding(c);
        # [1] "unknown" "unknown"

Note that this is incorrect - the second string should have been "UTF-8"

To fix this, you have to explicitly set the encoding on every string returned 
by every PostgreSQL command that returns strings from the database.  In the 
above example you must follow the dbGetQuery() with:

    Encoding(c$col1) <- "UTF-8"

Now the data.frame will be processed correctly.


What version of the product are you using? On what operating system?

Windows 7, R 2.9.2, RPostgreSQL 0.4, working inside Eclipse IDE 4.2.1 with 
StatET

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 24 May 2013 at 12:00

GoogleCodeExporter avatar Mar 19 '15 12:03 GoogleCodeExporter

What is the status of this? This is still an issue.

library(RPostgreSQL)
#> Lade nötiges Paket: DBI

(x_df <- data.frame(x = c("ä", "ö", "ü"), stringsAsFactors = FALSE))
#>   x
#> 1 ä
#> 2 ö
#> 3 ü

pg_con <- 
  dbConnect(
    RPostgreSQL::PostgreSQL(),
    # my credentials
  )

dbGetQuery(pg_con, "SHOW server_encoding")
#>   server_encoding
#> 1            UTF8
dbWriteTable(pg_con, "x_df", x_df, overwrite = TRUE)
#> [1] TRUE
(x_db <- dbReadTable(pg_con, "x_df"))
#>    x
#> 1 ä
#> 2 ö
#> 3 ü
Encoding(x_db$x)
#> [1] "unknown" "unknown" "unknown"
Encoding(x_db$x) <- "UTF-8"
x_db
#>   x
#> 1 ä
#> 2 ö
#> 3 ü

dbDisconnect(pg_con)
#> [1] TRUE

Same problem applies to dbGetQuery()

See also https://github.com/rstats-db/DBI/issues/116

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.1 (2017-06-30)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2017-09-11
#> Packages -----------------------------------------------------------------
#>  package     * version date       source                          
#>  backports     1.1.0   2017-05-22 CRAN (R 3.4.0)                  
#>  base        * 3.4.1   2017-06-30 local                           
#>  compiler      3.4.1   2017-06-30 local                           
#>  datasets    * 3.4.1   2017-06-30 local                           
#>  DBI         * 0.7     2017-06-18 CRAN (R 3.4.0)                  
#>  devtools      1.13.3  2017-08-02 CRAN (R 3.4.1)                  
#>  digest        0.6.12  2017-01-27 CRAN (R 3.3.2)                  
#>  evaluate      0.10.1  2017-06-24 CRAN (R 3.4.0)                  
#>  graphics    * 3.4.1   2017-06-30 local                           
#>  grDevices   * 3.4.1   2017-06-30 local                           
#>  htmltools     0.3.6   2017-04-28 CRAN (R 3.4.0)                  
#>  knitr         1.17    2017-08-10 CRAN (R 3.4.1)                  
#>  magrittr      1.5     2014-11-22 CRAN (R 3.3.0)                  
#>  memoise       1.1.0   2017-05-29 Github (hadley/memoise@e372cde) 
#>  methods     * 3.4.1   2017-06-30 local                           
#>  Rcpp          0.12.12 2017-07-15 CRAN (R 3.4.1)                  
#>  rmarkdown     1.6     2017-06-15 CRAN (R 3.4.0)                  
#>  RPostgreSQL * 0.6-2   2017-06-24 CRAN (R 3.4.0)                  
#>  rprojroot     1.2     2017-01-16 CRAN (R 3.3.2)                  
#>  stats       * 3.4.1   2017-06-30 local                           
#>  stringi       1.1.5   2017-04-07 CRAN (R 3.3.3)                  
#>  stringr       1.2.0   2017-02-18 CRAN (R 3.3.3)                  
#>  tools         3.4.1   2017-06-30 local                           
#>  utils       * 3.4.1   2017-06-30 local                           
#>  withr         2.0.0   2017-09-07 Github (jimhester/withr@eff4818)
#>  yaml          2.1.14  2016-11-12 CRAN (R 3.3.2)

dpprdan avatar Sep 11 '17 11:09 dpprdan

It is possible of course, that the server encoding is not UTF-8. A somewhat hacky solution to account for this would be to force the client encoding in R to UTF-8, with postgresqlExecStatement("SET client_encoding = 'UTF-8'") for example. Since R may not support the the server encoding, "UTF-8" would be the common denominator.

dpprdan avatar Nov 14 '17 17:11 dpprdan

Server encoding have nothing to do with this issue. Commit ca5182770aeab3950ef21d93029e3cac1be65e1a sets the encoding to the received text data according to the client encoding at the transfer.

tomoakin avatar Nov 16 '17 08:11 tomoakin

I am testing since yesterday, looking good so far.

dpprdan avatar Nov 17 '17 12:11 dpprdan

Still looking good, this can be closed, I think. Do you have plans to release an updated version to CRAN in the near future?

dpprdan avatar Apr 11 '19 15:04 dpprdan