ocaml-csv icon indicating copy to clipboard operation
ocaml-csv copied to clipboard

The BOM is readed as a value in the content

Open Chimrod opened this issue 3 years ago • 4 comments

Excel register the CSV encoded in UTF-8 by adding a Byte Order Mark in the very beginning of the file.

The library read thoses bytes and considered them as a part of the content, this can raise strange errors when the first cell is quoted, as in this minimal example:

bom.csv

# Csv.load "bom.csv" ~separator:';';;
Exception: Csv.Failure (2, 1, "Bad '\"' in quoted field").

I open the issue because I can’t guess if thoses bytes are presents when I stream a channel: I can discard the first 3 bytes if I a BOM is present, but I have no way to check for it, and revert the channel to the initial state if there is no BOM. So I think this should be handled in the library itself.

Chimrod avatar Jun 09 '22 12:06 Chimrod

Thanks for you report. May you open a PR adding an optional argument to check for that?

Chris00 avatar Jun 09 '22 20:06 Chris00

Yes, I would be happy to contribute :)

Just give me some time in order to go into the code and I will open a PR

Chimrod avatar Jun 10 '22 08:06 Chimrod

Ok, I have something which works, but I have a few questions before opening a PR.

Here is my code:

let bom = 
    [ [| '\xEF' ; '\xBB'; '\xBF' |]  (* UTF-8 *)
    ; [| '\xFE' ; '\xFF'  |]         (* UTF-16 BE *)
    ; [| '\xFF' ; '\xFE'  |]         (* UTF-16 LE *)
    ]

(** check_bom [ic] [bom_list] return [None] if the stream does not start with a bom
    and the number of bytes to ignore otherwise.
 *)
let check_bom
 : in_channel -> (char array) list -> int option 
 = fun ic bom -> 
     let matched = 
         List.fold_left (fun matched bom -> 
           match matched with 
           | Some _ -> matched (* already found one *)
           | None -> 
             let length = Array.length bom in
             if (ic.in1 - ic.in0) < length then None
             else
               (* We are supposed to be in the very begining of the file, 
                  and in0 should be equal to 0. *)
               let idx = ref (ic.in0) in
               let equal = Array.for_all 
                 (fun c -> 
                     let res = Char.equal c (Bytes.get ic.in_buf !idx) in 
                     incr idx;
                     res
                 ) 
                 bom in 
               match equal with 
               | true -> Some length
               | false -> None
               ) 
         None bom in 
     matched


let fill_in_buf_or_Eof ic =
                 …
                 match check_bom ic bom with 
                 | None -> ()
                 | Some l -> ic.in0 <- ic.in0 + l


However, this code will not work if the channel is blocking and give the input byte by byte (I expect to be able to read the full BOM in a single step). Have you a facility function which could do that ?

Also, I see that I need to write the code in the Lwt side of the library. Is there some things I need to care of before starting the implementation ?

Chimrod avatar Jun 10 '22 13:06 Chimrod

I’ve started the branch here : https://github.com/Chimrod/ocaml-csv/tree/bom

I’ll had a test before requesting a merge.

Chimrod avatar Jun 13 '22 06:06 Chimrod

Fixed in #41

SGrondin avatar Oct 07 '23 17:10 SGrondin