readr icon indicating copy to clipboard operation
readr copied to clipboard

read_fwf() incorrectly parses fixed-width files with multi-byte UTF-8 characters

Open alearrigo opened this issue 5 months ago • 0 comments

Problem description:

When reading fixed-width files containing multi-byte UTF-8 characters (like °, accented letters), read_fwf() counts character positions instead of byte positions, causing field misalignment in subsequent columns.

# Create a minimal reproducible example
# Each line should be exactly 10 characters long
test_lines <- c(
  "AAAA123456",           # Line 1: 10 ASCII characters
  "BBBB12°456",           # Line 2: 9 characters, but ° is 2 bytes in UTF-8 
  "CCCC123456"            # Line 3: 10 ASCII characters  
)

# Write to temporary file
temp_file <- tempfile(fileext = ".txt")
writeLines(test_lines, temp_file, sep = "\n")

# Define column positions for a 10-character fixed-width format
# Field 1: positions 1-4
# Field 2: positions 5-6 
# Field 3: positions 7-10

# Read with read_fwf
result <- read_fwf(
  temp_file,
  fwf_cols(
    field1 = c(1, 4),
    field2 = c(5, 6),
    field3 = c(7, 10)
  ),
  col_types = cols(.default = col_character())
)

print(result)

# Expected output:
# field1 field2 field3
# AAAA   12     3456
# BBBB   12     °456  
# CCCC   12     3456

# Actual output:
# field1 field2 field3
# AAAA   12     3456
# BBBB   12     °45    # Wrong! field3 should be "°456"
# CCCC   12     3456

# The issue: 
# Line 2 contains "°" which is encoded as 2 bytes in UTF-8 (0xC2 0xB0)
# but read_fwf() counts it as 2 character position instead of 1, causing subsequent 
# field boundaries to shift by 1 position.

# Verification - check actual byte lengths:
cat("Character lengths:\n")
for (i in seq_along(test_lines)) {
  cat(sprintf("Line %d: %d chars, %d bytes\n", 
              i, nchar(test_lines[i]), nchar(test_lines[i], type = "bytes")))
}


sessioninfo::session_info()
─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.2 (2024-10-31)
 os       macOS Sequoia 15.5
 system   aarch64, darwin20
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Rome
 date     2025-07-16
 rstudio  2025.05.0+496 Mariposa Orchid (desktop)
 pandoc   3.2.1 @ /opt/homebrew/bin/pandoc
 quarto   1.6.40 @ /usr/local/bin/quarto

─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────
 ! package      * version  date (UTC) lib source
 P arrow          20.0.0.2 2025-05-26 [?] CRAN (R 4.4.1)
 P assertthat     0.2.1    2019-03-21 [?] CRAN (R 4.4.0)
   bit            4.6.0    2025-03-06 [1] CRAN (R 4.4.1)
   bit64          4.6.0-1  2025-01-16 [1] CRAN (R 4.4.1)
   cli            3.6.4    2025-02-13 [1] CRAN (R 4.4.1)
   crayon         1.5.3    2024-06-20 [1] CRAN (R 4.4.1)
 P dplyr        * 1.1.4    2023-11-17 [?] RSPM (R 4.4.0)
 P farver         2.1.2    2024-05-13 [?] RSPM (R 4.4.0)
 P forcats      * 1.0.0    2023-01-29 [?] RSPM (R 4.4.0)
 P generics       0.1.4    2025-05-09 [?] CRAN (R 4.4.1)
   ggplot2      * 3.5.0    2024-02-23 [1] CRAN (R 4.3.1)
   glue           1.8.0    2024-09-30 [1] CRAN (R 4.4.1)
 P gtable         0.3.6    2024-10-25 [?] RSPM (R 4.4.0)
 P hms            1.1.3    2023-03-21 [?] CRAN (R 4.4.0)
 P lifecycle      1.0.4    2023-11-07 [?] RSPM (R 4.4.0)
   lubridate    * 1.9.4    2024-12-08 [1] CRAN (R 4.4.1)
 P magrittr       2.0.3    2022-03-30 [?] CRAN (R 4.4.0)
   pillar         1.10.1   2025-01-07 [1] CRAN (R 4.4.1)
 P pkgconfig      2.0.3    2019-09-22 [?] RSPM (R 4.4.0)
   purrr        * 1.0.4    2025-02-05 [1] CRAN (R 4.4.1)
   R6             2.6.1    2025-02-15 [1] CRAN (R 4.4.1)
 P RColorBrewer   1.1-3    2022-04-03 [?] CRAN (R 4.4.0)
   readr        * 2.1.5    2024-01-10 [1] CRAN (R 4.4.0)
   renv           1.0.0    2023-07-07 [1] CRAN (R 4.3.2)
   rlang          1.1.5    2025-01-17 [1] CRAN (R 4.4.1)
 P rstudioapi     0.17.1   2024-10-22 [?] CRAN (R 4.4.1)
 P scales         1.4.0    2025-04-24 [?] CRAN (R 4.4.1)
 P sessioninfo    1.2.3    2025-02-05 [?] CRAN (R 4.4.1)
   stringi        1.8.4    2024-05-06 [1] CRAN (R 4.4.1)
 P stringr      * 1.5.1    2023-11-14 [?] CRAN (R 4.4.0)
 P tibble       * 3.3.0    2025-06-08 [?] CRAN (R 4.4.1)
   tidyr        * 1.3.1    2024-01-24 [1] CRAN (R 4.4.1)
   tidyselect     1.2.1    2024-03-11 [1] CRAN (R 4.4.0)
 P tidyverse    * 2.0.0    2023-02-22 [?] CRAN (R 4.4.0)
   timechange     0.3.0    2024-01-18 [1] CRAN (R 4.4.1)
   tzdb           0.5.0    2025-03-15 [1] CRAN (R 4.4.1)
 P vctrs          0.6.5    2023-12-01 [?] CRAN (R 4.4.0)
 P vroom          1.6.5    2023-12-05 [?] CRAN (R 4.4.0)
   withr          3.0.2    2024-10-28 [1] CRAN (R 4.4.1)

 [1] /Users/alessandroarrigo/Documents/GitHub/VedaWare_Policlinico/renv/library/R-4.4/aarch64-apple-darwin20
 [2] /Users/alessandroarrigo/Library/Caches/org.R-project.R/R/renv/sandbox/R-4.4/aarch64-apple-darwin20/84ba8b13

 * ── Packages attached to the search path.
 P ── Loaded and on-disk path mismatch.

alearrigo avatar Jul 16 '25 17:07 alearrigo