csv
csv copied to clipboard
Pass through non-UTF8 bytes in lines preprocessor
Attempting to CSV.decode
a stream that contains non-UTF8 bytes raises a FunctionClauseError
:
** (FunctionClauseError) no function clause matching in CSV.Decoding.Preprocessing.Lines.starts_sequence?/5
The following arguments were given to CSV.Decoding.Preprocessing.Lines.starts_sequence?/5:
# 1
<<225, 110, 100, 101, 122>>
# 2
"n"
# 3
false
# 4
44
# 5
""
Attempted function clauses (showing 5 out of 5):
defp starts_sequence?(<<34::utf8(), tail::binary()>>, last_token, false, separator, _) when last_token == <<separator::utf8()>>
defp starts_sequence?(<<34::utf8(), tail::binary()>>, "", false, separator, _)
defp starts_sequence?(<<34::utf8(), tail::binary()>>, _, quoted, separator, sequence_start)
defp starts_sequence?(<<head::utf8(), tail::binary()>>, _, quoted, separator, sequence_start)
defp starts_sequence?("", _, quoted, _, sequence_start)
code: result = CSV.decode(stream) |> Enum.to_list()
stacktrace:
(csv 2.4.1) CSV.Decoding.Preprocessing.Lines.starts_sequence?/5
(csv 2.4.1) lib/csv/decoding/preprocessing/lines.ex:85: CSV.Decoding.Preprocessing.Lines.start_sequence/3
(elixir 1.13.0) lib/stream.ex:902: Stream.do_transform_user/6
This makes it impossible to handle encoding errors per-line or use machinery like Decoder
's replacement
option.
The code that would prevent this crash was accidentally deleted in https://github.com/beatrichartz/csv/commit/4f5069b99b8c0e4387c9e31798aed508b3f9998f because it is "unused" for files that only contain valid UTF8.
This PR restores the deleted clause and adds a high-level test; existing tests cover Decoder
and Lexer
but not the complete pipeline.
We are running into this same error with similar data.
Same here, experienced the issue with non UTF-8 characters
