readr
readr copied to clipboard
Write_csv creates unexpeted csv file if dataframe has columns with mixed encoding attribute.
if dataframe has columns with mixed encoding attribute, write_csv gerenates unexpected CSV file.
- create tibble with utf-8 (Japanese) column name.
library(tidyverse)
d1 <- tibble(
id = seq(1, 1000)
) %>%
mutate(
性別 = sample(x = c("男性","女性"), size = 1000, replace = TRUE),
第1回 = rnorm(1000),
第2回 = rnorm(1000),
第3回 = rnorm(1000),
第4回 = rnorm(1000),
第5回 = rnorm(1000),
第6回 = rnorm(1000),
第7回 = rnorm(1000),
)
- check column name encoding attribute and column contents encoding attribute
print(stringi::stri_enc_mark(colnames(d1)))
#> "ASCII" "native" "native" "native" "native" "native" "native" "native" "native"
Encoding(colnames(d1))
#> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
## check encoding attribute of column contents includeing Japanese char
Encoding(d1$性別)
#> [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" ...
- convert Japanese column name with "native" encoding attribute to row by pivot_longer
d2 <- d1 %>%
pivot_longer(cols=!c(id, 性別), names_to = "実施回", values_to = "スコア")
- check column name encoding attribute and column contents encoding attribute
print(stringi::stri_enc_mark(colnames(d2)))
#> [1] "ASCII" "native" "UTF-8" "UTF-8"
# second column created by mutate step
print(stringi::stri_enc_mark(d2$性別))
#> [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8"...
# third column converted by pivot_longer from column name
print(stringi::stri_enc_mark(d2$実施回))
#> [1] "native" "native" "native" "native" "native" "native" "native" "native"...
Second column ("性別") contents has "UTF-8" encoding attribute, because they created by mutate step. Other-hands, third column ("実施回") contents has "native" encoding attribute, because converted from column name that has "native" encoding attribute.
So, dataframe "d2" has column with "UTF-8" encoding attribute and column with "native" encoding attribute. When I try to write this "d2" dataframe to CSV file by "write_csv", unexpected CSV files is generate.
- write "d2" to CSV repeatedly
write_csv(x = d2, file = "file01.csv")
write_csv(x = d2, file = "file02.csv")
write_csv(x = d2, file = "file03.csv")
write_csv(x = d2, file = "file04.csv")
write_csv(x = d2, file = "file05.csv")
I expect all files is same, but these file different from each other.
- On terminal, get diff among files
$ diff file01.csv file02.csv
#> 16c16
#> < 3,女性,第7回,-0.6832698890819322
#> ---
#> > 3,女性,第1回,-0.6832698890819322
#> 147c147
#> < 21,女性,第6回,0.7099581154824097
#> ...
$ diff file03.csv file04.csv
#> 28c28
#> < 4,男性,第5回,-0.19047686780971046
#> ---
#> > 4,男性,第6回,-0.19047686780971046
#> 646c646
#> < 93,女性,第1回,0.6614739457424788
#> ...
- To fix, encode all column to "UTF-8"
d3 <- d2 %>%
mutate(across(everything(), ~stringi::stri_encode(.x, to = "UTF-8")))
write_csv(x = d3, file = "file01.csv")
write_csv(x = d3, file = "file02.csv")
write_csv(x = d3, file = "file03.csv")
write_csv(x = d3, file = "file04.csv")
write_csv(x = d3, file = "file05.csv")
All files is same.
The problem is that at first glance it appears to be generated correctly. "write_csv" show no error or warning messages. Is it possible to perform the write_csv correctly without converting it to UTF-8 by hand, or detecting errors?
Or Is this problem about dplyr (mutate) or tidyr (pivot_longer)?