ocaml-csv
ocaml-csv copied to clipboard
The BOM is readed as a value in the content
Excel register the CSV encoded in UTF-8 by adding a Byte Order Mark in the very beginning of the file.
The library read thoses bytes and considered them as a part of the content, this can raise strange errors when the first cell is quoted, as in this minimal example:
# Csv.load "bom.csv" ~separator:';';;
Exception: Csv.Failure (2, 1, "Bad '\"' in quoted field").
I open the issue because I can’t guess if thoses bytes are presents when I stream a channel: I can discard the first 3 bytes if I a BOM is present, but I have no way to check for it, and revert the channel to the initial state if there is no BOM. So I think this should be handled in the library itself.
Thanks for you report. May you open a PR adding an optional argument to check for that?
Yes, I would be happy to contribute :)
Just give me some time in order to go into the code and I will open a PR
Ok, I have something which works, but I have a few questions before opening a PR.
Here is my code:
let bom =
[ [| '\xEF' ; '\xBB'; '\xBF' |] (* UTF-8 *)
; [| '\xFE' ; '\xFF' |] (* UTF-16 BE *)
; [| '\xFF' ; '\xFE' |] (* UTF-16 LE *)
]
(** check_bom [ic] [bom_list] return [None] if the stream does not start with a bom
and the number of bytes to ignore otherwise.
*)
let check_bom
: in_channel -> (char array) list -> int option
= fun ic bom ->
let matched =
List.fold_left (fun matched bom ->
match matched with
| Some _ -> matched (* already found one *)
| None ->
let length = Array.length bom in
if (ic.in1 - ic.in0) < length then None
else
(* We are supposed to be in the very begining of the file,
and in0 should be equal to 0. *)
let idx = ref (ic.in0) in
let equal = Array.for_all
(fun c ->
let res = Char.equal c (Bytes.get ic.in_buf !idx) in
incr idx;
res
)
bom in
match equal with
| true -> Some length
| false -> None
)
None bom in
matched
let fill_in_buf_or_Eof ic =
…
match check_bom ic bom with
| None -> ()
| Some l -> ic.in0 <- ic.in0 + l
However, this code will not work if the channel is blocking and give the input byte by byte (I expect to be able to read the full BOM in a single step). Have you a facility function which could do that ?
Also, I see that I need to write the code in the Lwt side of the library. Is there some things I need to care of before starting the implementation ?
I’ve started the branch here : https://github.com/Chimrod/ocaml-csv/tree/bom
I’ll had a test before requesting a merge.
Fixed in #41