JSON parser breaks in presence of UTF BOM
I just tried reading a JSON file that I wrote using PowerShell and got the following failure:
(Error: (Corrupted_Format (File_Like.Value (File foo.json)) 'Parse error in parsing JSON: Unexpected character (\'\' (code 65279 / 0xfeff)): expected a valid value (JSON String, Number, Array, Object or token \'null\', \'true\' or \'false\') at position [line: 1, column: 2].' (Invalid_JSON.Error 'Unexpected character (\'\' (code 65279 / 0xfeff)): expected a valid value (JSON String, Number, Array, Object or token \'null\', \'true\' or \'false\') at position [line: 1, column: 2]')))
The 0xfeff character seems to be the problem.
I imagine we need to strip whitespace before parsing.
A smallest repro to see the bug is:
'\ufeff{}'.parse_json
I think Enso should be able to handle files encoded with BOM.
We may want to revise how our CSV parser handles such cases too.
Changes
By default, we should read the BOM when reading text in from a stream. Our default "charset" should be cleverer and use the BOM to determine which Unicode charset to use. If the default "charset" encounters an invalid character then we should fallback to Windows-1252. This should be the case for reading plain text, delimited tables, JSON and XML.
The error reporting should only report a few of the failing indexes (3) and a count of the total failures.