csv icon indicating copy to clipboard operation
csv copied to clipboard

Pass through non-UTF8 bytes in lines preprocessor

Open al2o3cr opened this issue 2 years ago • 2 comments

Attempting to CSV.decode a stream that contains non-UTF8 bytes raises a FunctionClauseError:

     ** (FunctionClauseError) no function clause matching in CSV.Decoding.Preprocessing.Lines.starts_sequence?/5

     The following arguments were given to CSV.Decoding.Preprocessing.Lines.starts_sequence?/5:

         # 1
         <<225, 110, 100, 101, 122>>

         # 2
         "n"

         # 3
         false

         # 4
         44

         # 5
         ""

     Attempted function clauses (showing 5 out of 5):

         defp starts_sequence?(<<34::utf8(), tail::binary()>>, last_token, false, separator, _) when last_token == <<separator::utf8()>>
         defp starts_sequence?(<<34::utf8(), tail::binary()>>, "", false, separator, _)
         defp starts_sequence?(<<34::utf8(), tail::binary()>>, _, quoted, separator, sequence_start)
         defp starts_sequence?(<<head::utf8(), tail::binary()>>, _, quoted, separator, sequence_start)
         defp starts_sequence?("", _, quoted, _, sequence_start)

     code: result = CSV.decode(stream) |> Enum.to_list()
     stacktrace:
       (csv 2.4.1) CSV.Decoding.Preprocessing.Lines.starts_sequence?/5
       (csv 2.4.1) lib/csv/decoding/preprocessing/lines.ex:85: CSV.Decoding.Preprocessing.Lines.start_sequence/3
       (elixir 1.13.0) lib/stream.ex:902: Stream.do_transform_user/6

This makes it impossible to handle encoding errors per-line or use machinery like Decoder's replacement option.

The code that would prevent this crash was accidentally deleted in https://github.com/beatrichartz/csv/commit/4f5069b99b8c0e4387c9e31798aed508b3f9998f because it is "unused" for files that only contain valid UTF8.

This PR restores the deleted clause and adds a high-level test; existing tests cover Decoder and Lexer but not the complete pipeline.

al2o3cr avatar May 15 '22 15:05 al2o3cr

We are running into this same error with similar data.

aezell avatar Jun 07 '22 22:06 aezell

Same here, experienced the issue with non UTF-8 characters

Screenshot 2022-06-17 at 6 00 05 PM

nwai90 avatar Jun 17 '22 10:06 nwai90