readr icon indicating copy to clipboard operation
readr copied to clipboard

Write_csv creates unexpeted csv file if dataframe has columns with mixed encoding attribute.

Open ujtwr opened this issue 2 years ago • 0 comments

if dataframe has columns with mixed encoding attribute, write_csv gerenates unexpected CSV file.

  1. create tibble with utf-8 (Japanese) column name.
library(tidyverse)

d1 <- tibble(
  id = seq(1, 1000)
) %>% 
  mutate(
    性別 = sample(x = c("男性","女性"), size = 1000, replace = TRUE),
    第1回 = rnorm(1000),
    第2回 = rnorm(1000),
    第3回 = rnorm(1000),
    第4回 = rnorm(1000),
    第5回 = rnorm(1000),
    第6回 = rnorm(1000),
    第7回 = rnorm(1000),
    )
  1. check column name encoding attribute and column contents encoding attribute
print(stringi::stri_enc_mark(colnames(d1)))
#> "ASCII"  "native" "native" "native" "native" "native" "native" "native" "native"

Encoding(colnames(d1))
#> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"

## check encoding attribute of column contents includeing Japanese char
Encoding(d1$性別)
#>  [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" ...
  1. convert Japanese column name with "native" encoding attribute to row by pivot_longer
d2 <- d1 %>% 
  pivot_longer(cols=!c(id, 性別), names_to = "実施回", values_to = "スコア")
  1. check column name encoding attribute and column contents encoding attribute
print(stringi::stri_enc_mark(colnames(d2)))
#> [1] "ASCII"  "native" "UTF-8"  "UTF-8" 

# second column created by mutate step
print(stringi::stri_enc_mark(d2$性別))
#>  [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8"...

# third column converted by pivot_longer from column name
print(stringi::stri_enc_mark(d2$実施回))
#>  [1] "native" "native" "native" "native" "native" "native" "native" "native"...

Second column ("性別") contents has "UTF-8" encoding attribute, because they created by mutate step. Other-hands, third column ("実施回") contents has "native" encoding attribute, because converted from column name that has "native" encoding attribute.

So, dataframe "d2" has column with "UTF-8" encoding attribute and column with "native" encoding attribute. When I try to write this "d2" dataframe to CSV file by "write_csv", unexpected CSV files is generate.

  1. write "d2" to CSV repeatedly
write_csv(x = d2, file = "file01.csv")
write_csv(x = d2, file = "file02.csv")
write_csv(x = d2, file = "file03.csv")
write_csv(x = d2, file = "file04.csv")
write_csv(x = d2, file = "file05.csv")

I expect all files is same, but these file different from each other.

  1. On terminal, get diff among files
$ diff file01.csv file02.csv 
#> 16c16
#> < 3,女性,第7回,-0.6832698890819322
#> ---
#> > 3,女性,第1回,-0.6832698890819322
#> 147c147
#> < 21,女性,第6回,0.7099581154824097
#> ...

$ diff file03.csv file04.csv 
#> 28c28
#> < 4,男性,第5回,-0.19047686780971046
#> ---
#> > 4,男性,第6回,-0.19047686780971046
#> 646c646
#> < 93,女性,第1回,0.6614739457424788
#> ...
  1. To fix, encode all column to "UTF-8"
d3 <- d2 %>% 
  mutate(across(everything(), ~stringi::stri_encode(.x, to = "UTF-8"))) 

write_csv(x = d3, file = "file01.csv")
write_csv(x = d3, file = "file02.csv")
write_csv(x = d3, file = "file03.csv")
write_csv(x = d3, file = "file04.csv")
write_csv(x = d3, file = "file05.csv")

All files is same.

The problem is that at first glance it appears to be generated correctly. "write_csv" show no error or warning messages. Is it possible to perform the write_csv correctly without converting it to UTF-8 by hand, or detecting errors?

Or Is this problem about dplyr (mutate) or tidyr (pivot_longer)?

ujtwr avatar Jul 18 '22 05:07 ujtwr