readr
readr copied to clipboard
read_fwf() incorrectly parses fixed-width files with multi-byte UTF-8 characters
Problem description:
When reading fixed-width files containing multi-byte UTF-8 characters (like °, accented letters), read_fwf() counts character positions instead of byte positions, causing field misalignment in subsequent columns.
# Create a minimal reproducible example
# Each line should be exactly 10 characters long
test_lines <- c(
"AAAA123456", # Line 1: 10 ASCII characters
"BBBB12°456", # Line 2: 9 characters, but ° is 2 bytes in UTF-8
"CCCC123456" # Line 3: 10 ASCII characters
)
# Write to temporary file
temp_file <- tempfile(fileext = ".txt")
writeLines(test_lines, temp_file, sep = "\n")
# Define column positions for a 10-character fixed-width format
# Field 1: positions 1-4
# Field 2: positions 5-6
# Field 3: positions 7-10
# Read with read_fwf
result <- read_fwf(
temp_file,
fwf_cols(
field1 = c(1, 4),
field2 = c(5, 6),
field3 = c(7, 10)
),
col_types = cols(.default = col_character())
)
print(result)
# Expected output:
# field1 field2 field3
# AAAA 12 3456
# BBBB 12 °456
# CCCC 12 3456
# Actual output:
# field1 field2 field3
# AAAA 12 3456
# BBBB 12 °45 # Wrong! field3 should be "°456"
# CCCC 12 3456
# The issue:
# Line 2 contains "°" which is encoded as 2 bytes in UTF-8 (0xC2 0xB0)
# but read_fwf() counts it as 2 character position instead of 1, causing subsequent
# field boundaries to shift by 1 position.
# Verification - check actual byte lengths:
cat("Character lengths:\n")
for (i in seq_along(test_lines)) {
cat(sprintf("Line %d: %d chars, %d bytes\n",
i, nchar(test_lines[i]), nchar(test_lines[i], type = "bytes")))
}
sessioninfo::session_info()
─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.4.2 (2024-10-31)
os macOS Sequoia 15.5
system aarch64, darwin20
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Rome
date 2025-07-16
rstudio 2025.05.0+496 Mariposa Orchid (desktop)
pandoc 3.2.1 @ /opt/homebrew/bin/pandoc
quarto 1.6.40 @ /usr/local/bin/quarto
─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P arrow 20.0.0.2 2025-05-26 [?] CRAN (R 4.4.1)
P assertthat 0.2.1 2019-03-21 [?] CRAN (R 4.4.0)
bit 4.6.0 2025-03-06 [1] CRAN (R 4.4.1)
bit64 4.6.0-1 2025-01-16 [1] CRAN (R 4.4.1)
cli 3.6.4 2025-02-13 [1] CRAN (R 4.4.1)
crayon 1.5.3 2024-06-20 [1] CRAN (R 4.4.1)
P dplyr * 1.1.4 2023-11-17 [?] RSPM (R 4.4.0)
P farver 2.1.2 2024-05-13 [?] RSPM (R 4.4.0)
P forcats * 1.0.0 2023-01-29 [?] RSPM (R 4.4.0)
P generics 0.1.4 2025-05-09 [?] CRAN (R 4.4.1)
ggplot2 * 3.5.0 2024-02-23 [1] CRAN (R 4.3.1)
glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.1)
P gtable 0.3.6 2024-10-25 [?] RSPM (R 4.4.0)
P hms 1.1.3 2023-03-21 [?] CRAN (R 4.4.0)
P lifecycle 1.0.4 2023-11-07 [?] RSPM (R 4.4.0)
lubridate * 1.9.4 2024-12-08 [1] CRAN (R 4.4.1)
P magrittr 2.0.3 2022-03-30 [?] CRAN (R 4.4.0)
pillar 1.10.1 2025-01-07 [1] CRAN (R 4.4.1)
P pkgconfig 2.0.3 2019-09-22 [?] RSPM (R 4.4.0)
purrr * 1.0.4 2025-02-05 [1] CRAN (R 4.4.1)
R6 2.6.1 2025-02-15 [1] CRAN (R 4.4.1)
P RColorBrewer 1.1-3 2022-04-03 [?] CRAN (R 4.4.0)
readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)
renv 1.0.0 2023-07-07 [1] CRAN (R 4.3.2)
rlang 1.1.5 2025-01-17 [1] CRAN (R 4.4.1)
P rstudioapi 0.17.1 2024-10-22 [?] CRAN (R 4.4.1)
P scales 1.4.0 2025-04-24 [?] CRAN (R 4.4.1)
P sessioninfo 1.2.3 2025-02-05 [?] CRAN (R 4.4.1)
stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.1)
P stringr * 1.5.1 2023-11-14 [?] CRAN (R 4.4.0)
P tibble * 3.3.0 2025-06-08 [?] CRAN (R 4.4.1)
tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.1)
tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
P tidyverse * 2.0.0 2023-02-22 [?] CRAN (R 4.4.0)
timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.1)
tzdb 0.5.0 2025-03-15 [1] CRAN (R 4.4.1)
P vctrs 0.6.5 2023-12-01 [?] CRAN (R 4.4.0)
P vroom 1.6.5 2023-12-05 [?] CRAN (R 4.4.0)
withr 3.0.2 2024-10-28 [1] CRAN (R 4.4.1)
[1] /Users/alessandroarrigo/Documents/GitHub/VedaWare_Policlinico/renv/library/R-4.4/aarch64-apple-darwin20
[2] /Users/alessandroarrigo/Library/Caches/org.R-project.R/R/renv/sandbox/R-4.4/aarch64-apple-darwin20/84ba8b13
* ── Packages attached to the search path.
P ── Loaded and on-disk path mismatch.