cassava icon indicating copy to clipboard operation
cassava copied to clipboard

Fail to parse UTF-8 file with BOM

Open abailly opened this issue 9 years ago • 10 comments

When trying to decode a CSV file with BOM (U+EFFF at beginning of file), it fails with the following error:

*** Exception: parse error (Failed reading: satisfy) at " D1 "," Account Number "," Value Date "," Date "," Time "," Description "," Your Reference "," Our  (truncated)

A possible solution would be to allow user to pass a Text instead of a ByteString ?

abailly avatar Dec 12 '15 17:12 abailly

Just hit this as well. Very annoying!

3noch avatar Oct 09 '17 17:10 3noch

I also faced this issue :( And I spent a lot of time to discover that problem was actually in BOM... Better error message would be appreciated!

chshersh avatar Nov 18 '17 15:11 chshersh

I fell into the same trap: https://github.com/haskell-hvr/cassava/issues/160. The error message one gets in that case is really not particularly helpful. I guess that's the downside of using attoparsec for performance.

peti avatar Mar 17 '18 13:03 peti

A possible solution would be to allow user to pass a Text instead of a ByteString ?

That would be an interesting idea (it would be interesting to see if this can be done w/o duplicating cassava's API), but even then you'd have to deal with a BOM somewhere (as the BOM code-point would still be part of the Text).

What we can do in any case is improving the documentation to warn about this; and give a simple recipe for filtering a BOM if the user expects it (personally I consider a BOM in UTF-8 encodings a sign that something's wrong, as UTF-8 BOMs cause a lot of interoperability issues all over the place; so that's why I wouldn't want cassava to silently strip them out by default).

As for the recipe, bytestring e.g. offers the following verb

stripPrefix :: ByteString -> ByteString -> Maybe ByteString 

So the recipe would simply be something like

stripUtf8Bom :: BS.ByteString -> BS.ByteString
stripUtf8Bom bs = fromMaybe bs (BS.stripPrefix "\239\187\191" bs)

EDIT: fixed stripUtf8Bom as pointed out by https://github.com/haskell-hvr/cassava/issues/106#issuecomment-379228397

hvr avatar Mar 18 '18 10:03 hvr

I had to hack around a bom codepoint in Text, as alluded to above.

ghost avatar Apr 05 '18 04:04 ghost

I asked for the solution of this problem on SO when I first encountered this problem:

  • https://stackoverflow.com/questions/47367728/simplest-way-to-remove-bom-from-haskell-bytestring

Maybe solution from SO will be faster than stripPrefix since I expect take and drop functions to not do any copying at all (just slicing).

chshersh avatar Apr 05 '18 09:04 chshersh

@ChShersh stripPrefix doesn't do any copying either

hvr avatar Apr 05 '18 09:04 hvr

First of all: thanks a bunch to @hvr for this elegant and simple solution.

However, unless I'm mistaken, \357\273\277 is the UTF-8 BOM expressed in octal, and trying to strip with this string doen't work on my setup - I suspect it wouldn't typically. It works if you express it in decimal though, using \239\187\191, or if you make the use of octal explicit: \o357\o273\o277.

adfretlink avatar Apr 06 '18 11:04 adfretlink

@adfretlink good catch!

hvr avatar Apr 07 '18 09:04 hvr

I understand the decision to not directly address this, as it isn't strictly in scope for CSV, even if I personally think it's easier to fix directly on pareto-y grounds, particularly since it seems like plenty of extant programs continue to put out BOMs.

With that said better error messages would be very helpful here, as this was unnecessarily painful to debug.

tysonzero avatar Oct 05 '21 19:10 tysonzero