fst icon indicating copy to clipboard operation
fst copied to clipboard

UTF-8 Encoding Info Loses If The First String Is ASCII

Open shrektan opened this issue 7 years ago • 15 comments

Hi, there's a bug in the dev version. The utf8 encoding doesn't get preserved if the first string is ASCII. See the reproducible example below, thanks.

library(fst)
library(data.table)
dt1 <- data.table(
  a = enc2utf8(c("english", "中文"))
)
dt2 <- data.table(
  a = enc2utf8(c("中文", "english"))
)
file1 <- tempfile(fileext = ".fst")
file2 <- tempfile(fileext = ".fst")
write_fst(dt1, file1)
write_fst(dt2, file2)
res1 <- read_fst(file1)
res2 <- read_fst(file2)
print(res1)
#>         a
#> 1 english
#> 2   涓枃
print(res2)
#>         a
#> 1    中文
#> 2 english
Encoding(res1$a)
#> [1] "unknown" "unknown"
Encoding(res2$a)
#> [1] "UTF-8"   "unknown"
Encoding(res1$a) <- "UTF-8"
print(res1)
#>         a
#> 1 english
#> 2    中文

Created on 2018-05-22 by the reprex package (v0.2.0).

Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                                              
#>  version  R version 3.4.4 (2018-03-15)                       
#>  os       Windows 7 x64 SP 1                                 
#>  system   x86_64, mingw32                                    
#>  ui       RTerm                                              
#>  language (EN)                                               
#>  collate  Chinese (Simplified)_People's Republic of China.936
#>  tz       Asia/Taipei                                        
#>  date     2018-05-22                                         
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version     date       source                            
#>  assertthat    0.2.0       2017-04-11 CRAN (R 3.4.4)                    
#>  backports     1.1.2       2017-12-13 CRAN (R 3.4.3)                    
#>  cli           1.0.0       2017-11-05 CRAN (R 3.4.4)                    
#>  clisymbols    1.2.0       2017-05-21 CRAN (R 3.4.4)                    
#>  crayon        1.3.4       2017-09-16 CRAN (R 3.4.4)                    
#>  data.table  * 1.11.0      2018-05-01 CRAN (R 3.4.4)                    
#>  desc          1.2.0       2018-05-01 CRAN (R 3.4.4)                    
#>  devtools      1.13.5.9000 2018-05-10 local                             
#>  digest        0.6.15      2018-01-28 CRAN (R 3.4.3)                    
#>  evaluate      0.10.1      2017-06-24 CRAN (R 3.4.1)                    
#>  fst         * 0.8.6       2018-05-22 Github (fstpackage/fst@6eeb3e3)   
#>  htmltools     0.3.6.9000  2018-05-07 local                             
#>  knitr         1.20        2018-02-20 CRAN (R 3.4.4)                    
#>  magrittr      1.5         2014-11-22 CRAN (R 3.4.4)                    
#>  memoise       1.1.0       2017-04-21 CRAN (R 3.4.4)                    
#>  pkgbuild      0.0.0.9000  2017-12-06 Github (r-lib/pkgbuild@ce7f6d1)   
#>  pkgload       0.0.0.9000  2017-12-06 Github (r-lib/pkgload@70eaef8)    
#>  R6            2.2.2       2017-06-17 CRAN (R 3.4.4)                    
#>  Rcpp          0.12.16     2018-03-13 CRAN (R 3.4.4)                    
#>  rlang         0.2.0       2018-02-20 CRAN (R 3.4.3)                    
#>  rmarkdown     1.9         2018-03-01 CRAN (R 3.4.4)                    
#>  rprojroot     1.3-2       2018-01-03 CRAN (R 3.4.3)                    
#>  sessioninfo   1.0.1.9000  2017-12-06 Github (r-lib/sessioninfo@c871d01)
#>  stringi       1.1.7       2018-03-12 CRAN (R 3.4.4)                    
#>  stringr       1.3.0       2018-02-19 CRAN (R 3.4.3)                    
#>  testthat      2.0.0       2017-12-13 CRAN (R 3.4.4)                    
#>  usethis       1.3.0       2018-02-24 CRAN (R 3.4.4)                    
#>  withr         2.1.2       2018-03-15 CRAN (R 3.4.4)                    
#>  yaml          2.1.19      2018-05-01 CRAN (R 3.4.4)

shrektan avatar May 22 '18 06:05 shrektan

Hi @shrektan, thanks for reporting!

The current behavior of fst is that it checks the encoding of the first element in a column. The encoding of that first element is stored and it's assumed that the remainder of the column uses identical encoding.

So your example actually shows expected behavior but I can see how that can be very confusing (and perhaps a sub-optimal solution).

Also, there is an argument uniform_encoding that can be used to override this behavior. If set to FALSE, no uniform encoding is assumed, but in your case that will probably lead to an error:

``` r
dt1 <- data.table::data.table(
  a = enc2utf8(c("english", "中文"))
)

file1 <- tempfile(fileext = ".fst")

fst::write_fst(dt1, file1, uniform_encoding = FALSE)

#> Error in fst::write_fst(dt1, file1, uniform_encoding = FALSE) : 
#>   Character vectors with mixed encodings are currently not supported

so that currently forces the user to make sure that each element in the vector has identical encoding. Perhaps it would be more clear to have an argument force_UTF8_encoding that can be set to TRUE if the default behavior needs an override?

thanks

MarcusKlik avatar May 22 '18 09:05 MarcusKlik

Yes, having ASCII strings mixed with UTF-8 strings is not uncommon and it's very difficult to ensure an UTF-8 string always gets stored in the first element. An argument force_UTF8_encoding looks good to me. Thanks.

shrektan avatar May 22 '18 11:05 shrektan

Hi @shrektan, thanks, so you're saying that the current strategy is not very effective, because non UTF-8 strings are often found in combination with UTF-8 strings (as a first element). Then indeed forcing the encoding is a better option, although it will take a pretty heavy performance hit:

# generate sample table (first element non-UTF-8)
x <- c("english", sample(c("english", "<U+4E2D><U+6587>"), 10e6, replace = TRUE))
ft <- data.frame(X = x, stringsAsFactors = FALSE)

# read / write cycle
fst::write.fst(ft, "1.fst")
ft2 <- fst::read_fst("1.fst")
#> Loading required namespace: data.table
y <- ft2[[1]]

# incorrectly encoded
print(y[1:10])
#>  [1] "english" "中文"  "english" "中文"  "中文"  "中文"  "中文"  "中文"  "中文"  "中文" 

# re-encode result
timings <- microbenchmark::microbenchmark(
  Encoding(y) <- "UTF-8",
  times = 1
)

# speed in GB/s
as.integer(object.size(y)) / as.numeric(median(timings$time))
#> [1] 0.06249857

# correctly encoded
print(y[1:10])
#>  [1] [1] "english" "中文"    "english" "中文"    "中文"    "中文"    "中文"    "中文"    "中文"    "中文"

So this vector can be re-encoded at around ~60MB/s, much slower than all other IO processes during the file read. We'll have to see how fast the encoding will be at the C level, but it will definitely slow things down significantly.

(but correctness is more important than performance in this case :-))

MarcusKlik avatar May 22 '18 21:05 MarcusKlik

Hi @MarcusKlik , I made a simple c version test for you (about 4x faster than R version). Hope it would be helpful. (Note I learned using LEVELS to tell if is utf8 from data.table)

Rcpp::cppFunction("void enc2utf8(Rcpp::StringVector x) {
const int n = x.size();
for (int i = 0; i < n; ++i) {
  x[i] = MARKUTF8(x[i]);
}
}", includes = "
#define IS_UTF8(x)  (LEVELS(x) & 8)
#define IS_ASCII(x) (LEVELS(x) & 64)
#define NEED2UTF8(s) !(IS_ASCII(s) || (s)==NA_STRING || IS_UTF8(s))
#define MARKUTF8(s) (!NEED2UTF8(s) ? (s) : Rf_mkCharCE(s, CE_UTF8))")



x <- c("english", sample(c("english", "\u4e2d\u6587"), 10e6, replace = TRUE))
ft <- data.frame(X = x, stringsAsFactors = FALSE)

# read / write cycle
tmpfile <- tempfile(fileext = ".fst")
fst::write.fst(ft, tmpfile)
ft2 <- fst::read_fst(tmpfile)
#> Loading required namespace: data.table
y <- ft2[[1]]
y2 <- data.table::copy(y)

# incorrectly encoded
print(y[1:10])
#>  [1] "english" "english" "english" "涓枃"   "english" "涓枃"   "涓枃"  
#>  [8] "涓枃"   "涓枃"   "涓枃"
print(y2[1:10])
#>  [1] "english" "english" "english" "涓枃"   "english" "涓枃"   "涓枃"  
#>  [8] "涓枃"   "涓枃"   "涓枃"

# re-encode result
timings <- microbenchmark::microbenchmark(
  Encoding(y) <- "UTF-8",
  enc2utf8(y2),
  times = 5,
  control = list(order = "inorder")
)

# speed in GB/s
as.integer(object.size(y)) / as.numeric(median(timings$time[seq.int(1, 10, by = 2)]))
#> [1] 0.1221907
as.integer(object.size(y2)) / as.numeric(median(timings$time[seq.int(0, 10, by = 2)]))
#> [1] 0.5242667

# correctly encoded
print(y[1:10])
#>  [1] "english" "english" "english" "中文"    "english" "中文"    "中文"   
#>  [8] "中文"    "中文"    "中文"
print(y2[1:10])
#>  [1] "english" "english" "english" "中文"    "english" "中文"    "中文"   
#>  [8] "中文"    "中文"    "中文"

Created on 2018-05-23 by the reprex package (v0.2.0).

shrektan avatar May 23 '18 02:05 shrektan

Hi @shrektan, nice work! Interesting that you directly use the general purpose bits from the sxpinfo header (didn't realize the encoding was stored there), and a factor 4 speed improvement is great.

Because the code is accessing the sxpinfo header and Rf_mkCharCE() is (possibly) called, the conversions will have to be done on the main thread. But they can be done in parallel with loading of the next chunk of column data from the fst file.

Thanks a lot, I'll make sure the next release features the force_UTF8_encoding parameter!

MarcusKlik avatar May 24 '18 21:05 MarcusKlik

I've ran into this issue previously. It's also possible to use a (bit hacky) R-based solution. When writing a fst to disk, one could add an extra first row to it, with all character columns containing a string with the desired encoding. Then, after reading the fst file into R, one could simply remove the first row.

Suppose we have a dataframe where the first element of a character column cannot be utf-8 encoded:

library(tidyverse)
library(fst)

df <- tibble(country = enc2utf8(c('nederland', 'italië')),
             random_variable = c(3, 2))

Encoding(df$country)
#> [1] "unknown" "UTF-8"

Writing and reading will cause encoding trouble:

fst::write_fst(df, 'enc_test.fst')
df_new <- fst::read_fst('enc_test.fst')

Encoding(df_new$country)
#> [1] "unknown" "unknown"

However, adding an extra row with the desired encoding will solve this:

write_fst_custom <- function(x, path){
  
  extra_first_row <- dplyr::slice(x, 1) %>%
    dplyr::mutate_if(is.character,
                     list(~replace(., TRUE, enc2utf8('belgië'))))
  
  x <- dplyr::bind_rows(extra_first_row, x)
  
  fst::write_fst(x, path)
  
}

read_fst_custom <- function(path){
  
  fst::read_fst(path)[-1, ]
  
}


write_fst_custom(df, 'enc_test.fst')
df_b <- read_fst_custom('enc_test.fst')

Encoding(df_b$country)
#> [1] "unknown" "UTF-8"

Created on 2019-06-26 by the reprex package (v0.3.0)

This solution could be made more general, for example by

  1. detect types of encoding present, for example with table(Encoding(column))
  2. check if, besides 'unknown', only a single encoding type is used (for example utf-8)
  3. make the added row when writing to disk dependent on the detected encoding

steffengreup avatar Jun 26 '19 13:06 steffengreup

Hi @steffengreup, thanks for reporting and for providing your code examples!

In the current implementation, fst scans for the first non-NA element in a character vector and determines the encoding of that element. That encoding is set as the encoding for the entire vector.

The reason this is done like that is performance: it takes a relatively long time to determine the encoding for a full vector, as each element has to be retrieved through the (slow) R API (speed is around 60 MB/s as you can see in the experiment above). In most cases, the encoding of the vector will be the same for each element, that's why this method was chosen.

Errors can be avoided by using uniform = FALSE, for your example:

library(dplyr)

df <- tibble(
  country = enc2utf8(c('nederland', 'italië')),
  random_variable = c(3, 2)
)

# first element has ASCII encoding
stringi::stri_enc_mark(df$country)
#> [1] "ASCII" "UTF-8"

# fst checks for and found mixed encodings
df %>%
  fst::write_fst('enc_test.fst', uniform_encoding = FALSE)
#> Error in fst::write_fst(., "enc_test.fst", uniform_encoding = FALSE): Character vectors with mixed encodings are currently not supported

that makes the user aware of the mixed encoding in the source data, but it doesn't provide a way for correctly handling that potential problem.

Your solution forces the encoding by 'manually' setting the first element to the desired encoding, that way the encoding scan finds that element first and sets the correct encoding.

I still think that storing the encoding of each element in a character vector in the fst format is a waste of space and speed, but perhaps we need a setting to force all character columns to UTF-8 if the user requires that. @shrektan provided a nice example on how that could be done from C at high speed...

thanks!

MarcusKlik avatar Jun 27 '19 09:06 MarcusKlik

I think this is causing issues here https://github.com/xiaodaigh/disk.frame/issues/315 as well.

xiaodaigh avatar Oct 13 '20 11:10 xiaodaigh

Hi @xiaodaigh, thanks for the link, the following code example illustrates why you are having these problems:

library(dplyr)
library(fst)

# a table with mixed encodings
tbl <- tibble(
  country1 = enc2utf8(c("Nederland", "Italië")),
  country2 = enc2utf8(c("Italië", "Nederland"))
)

# show mixed encodings
stringi::stri_enc_mark(tbl$country1)
#> [1] "ASCII" "UTF-8"

stringi::stri_enc_mark(tbl$country2)
#> [1] "UTF-8" "ASCII"

path <- tempfile(".fst")

# fst generates error to warn the user
tbl %>%
  write_fst(path, uniform_encoding = FALSE)
#> Error in write_fst(., path, uniform_encoding = FALSE) : 
#>   Character vectors with mixed encodings are currently not supported

# but if you assume uniform encoding
tbl %>%
  write_fst(path)

# things get messy
read_fst(path)
#>    country1  country2
#> 1 Nederland    Italië
#> 2   Italië Nederland

so, unless you set uniform_encoding = FALSE when calling write_fst(), fst will not warn you about mixed encodings. So at the moment, that means that the user is responsible for setting a uniform encoding.

Checking or enforcing encoding takes a bite out of the performance, so that's the reason why fst leaves that to the user. But I agree that the current behavior can introduce unexpected results for users not aware of the encoding strategy used, and that's worse than reduced performance. Perhaps the default should be to force all encodings to UTF8 unless uniform_encoding = TRUE.

MarcusKlik avatar Oct 15 '20 07:10 MarcusKlik

Perhaps the default should be to force all encodings to UTF8 unless uniform_encoding = TRUE.

Not a bad idea.

Simply fst::write_fst(a, "ok.fst", uniform_encoding = FALSE) throws an error

Error in fst::write_fst(a, "ok.fst", uniform_encoding = FALSE) : 
  Character vectors with mixed encodings are currently not supported

xiaodaigh avatar Oct 15 '20 13:10 xiaodaigh

I still think that storing the encoding of each element in a character vector in the fst format is a waste of space and speed, but perhaps we need a setting to force all character columns to UTF-8 if the user requires that.

@MarcusKlik , that's what I'm looking forward, I think it is great idea and it would avoid a lot of headache. Ideally, the user should be able to also choose another encoding like "Latin-1" (as data.table fread and fwrite allow), right? This is the encoding that works for my language.

Right now I cannot properly use the disk.frame package for my work because of this very issue. https://github.com/xiaodaigh/disk.frame/issues/315

Thank you for the package, I'm very impressed by fst speed and I hope it becomes a more universal solution

GitHunter0 avatar Feb 07 '21 22:02 GitHunter0

Hi @GitHunter0, thanks for your suggestion. And I agree, the current implementation will definitely cause you a lot of headaches if your language requires reencoding all character vectors before writing to disk using fst::write_fst()!

Also, fst assuming a uniform encoding, while beneficial for speed, reduces overall stability, so that's not ideal.

The easiest solution would be to allow for a parameter force_encoding to be set to a specific encoding. The same option could be added to the global options to force the encoding to a default setting with each call to write_fst().

MarcusKlik avatar Mar 04 '21 14:03 MarcusKlik

Hey @MarcusKlik , that would be great! Thanks for the feedback

GitHunter0 avatar Mar 04 '21 18:03 GitHunter0

Any progress on this?

StaffanBetner avatar Feb 07 '22 21:02 StaffanBetner

Is there any update yet? Especially with UTF-8 data, this issue is very annoying. A work-around without a force_encoding argument could probably be to look for the first non-NA non "unknown"-encoded element? Or just auto-mark ASCII as UTF-8 since that will not incur encoding problems as ASCII is a subset of UTF-8.

Encoding(c("a", "ä"))
#> [1] "unknown" "UTF-8"  

AshesITR avatar Aug 28 '23 09:08 AshesITR