specs Default (Tabular) Data Resource encoding UTF-8: with or without BOM?

Data Resource and Tabular Data Resource spec says if no encoding is given, it is utf-8. It doesn't further specify if this is UTF-8 with or without BOM.

I know BOM is not recommended for UTF-8, but we've got excel on windows spewing out UTF-8 with BOM, so it's not uncommon in the wild.

So I guess there are two questions:

Should the datapackage default encoding say anything about with or without BOM
How should (datapackage) software handle BOM's in utf-8 files?
1. Should it preserve or strip BOM in the decoded string you use in your code?
2. Should it preserve BOM (keep exact encoding) or strip it (accept many inputs, give only one - canonical - output) when writing back to an encoded file?

Preserve or strip BOM in the decoded string you use in your code?

I've seen many javascript csv readers automatically remove the BOM when reading a UTF-8 file. If they didn't there would be a hidden/non-printable character in the first column header. That can lead to quite some hard-to-find bugs.

But for example node js' FileSystem library doesn't strip it.

If fs.readFileSync() strips the BOM automatically,

var text = fs.readFileSync('foo.tx', 'utf8');
fs.writeFileSync('foo.txt', text, 'utf8');

The BOM is lost...

Both good arguments but I lean towards stripping. Preservering seems more of a low-level library thing to do.

Preserve or strip BOM when writing to a file?

Always stripping BOM and writing to UTF-8 without BOM would follow the robustness principle. You have to accept UTF-8 BOM is an expected input, but at least you're helping making the world a little more uniform.

But, changing an encoding as a side-effect of your program can be a "confusing" feature. Also, some software (e.g. Excel) relies on the BOM for correct encoding identification. So preserving it is not a bad idea either.

My gut says not including BOM in decoded strings and stripping BOM when writing. What do you think?

Apr 26 '18 09:04 jheeffer

@jheeffer first thanks for opening - it's really good to get these details ironed out.

I generally agree with your intuition and actually came across this issue myself i think with some world bank data. I guess my question here is whether this should be a MUST or a SHOULD. Basically a MUST puts a burden on publishers whilst a SHOULD puts a burden on consumers (and library maintainers). My general approach has been to try to strike a balance: generally publishers are less expert than tool developers but i also want to keep things simpler for tool developers.

As I read your proposal I think it strikes this balance and implies a SHOULD though with the added requirement that tools would have to handle read utf8 with the BOM and being ready to strip it.

wdyt?

Apr 28 '18 20:04 rufuspollock

What should be in the spec?

My idealist and simplicity-loving gut says data resource files MUST be UTF-8 without BOM. Software MUST accept input with BOM and MUST write without BOM.

The recommendation of the unicode consortium is files SHOULD be without BOM (and doesn't say anything about stripping).

2.6 Encoding Schemes ... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

Furthermore, it says that if encoding is otherwise signaled, as it is in datapackages, the BOM SHOULD NOT be used:

Unicode Signature. An initial BOM may also serve as an implicit marker to identify a file as containing Unicode text. [...] Data streams (or files) that begin with the U+FEFF byte order mark are likely to contain Unicode characters. It is recommended that applications sending or receiving untyped data streams of coded characters use this signature. If other signaling methods are used, signatures should not be employed.

So, SHOULD be UTF-8 without BOM and SHOULD strip BOM when saving as datapackage would follow the unicode standard. Being stricter, like my gut wants, could bring stakeholders in trouble because their other UTF-8 tools rely on BOM.

Reality bites

However, I'm afraid of unaware users - who are probably the vast majority - downloading datasets with non-ascii utf-8 characters, trying to open them in Excel, getting gibberish and thus dismissing the datasets as broken. From that perspective you might even argue for MUST encode UTF-8 with BOM.

I.e. the burden on consumers isn't the spec saying SHOULD instead of MUST. It's the UTF-8 without BOM itself, making files unreadable in the world's most used spreadsheet application.

Yes, you can still open them through the data tab using "from text" and choosing the right encoding, but I wonder how many data consumers know about it. From my experience: few.

I feel it's a debate between idealism and pragmatism. A world without (mainly Microsoft's) use of BOM as UTF-8 signature would not need BOM (encoding sniffing is a pretty well developed and applied alternative). Sadly, we don't live in that world and we don't want dataset consumers to have to learn about encodings. If we want open data to be widely accepted, we have to make the bar as low as possible.

I'm torn on this issue.

One way out is that "SHOULD without BOM" in the spec gives enough leeway to save with BOM for Excel compatibility. However, if your aim is truly frictionless data, and given that excel is so ubiquitous, "SHOULD with BOM" would be a more appropriate choice for the spec, even though it goes against the standard.

Apr 28 '18 22:04 jheeffer

@jheeffer it sounds like it is clear that consumer libraries SHOULD / MUST strip BOM given that some publishers may include esp if they want Excel to consume the data. Does that sound like a clear start (with the question of what publishers should do still to be worked out)?

Jul 08 '18 08:07 rufuspollock

Yes, I'd be in favour of consumer libraries stripping BOM. It would be easiest for users of these libraries. BOM is a low level detail very few people would be interested in when reading data. Additionally, I propose adding that if the library supports returning metadata, BOM presence SHOULD be part of it (e.g. https://www.papaparse.com/docs#meta, and config object in https://www.papaparse.com/docs#json-to-csv - note: this is where they would be, papaparse doesn't support BOM metadata atm).

Jul 08 '18 10:07 jheeffer

specs specs copied to clipboard

Default (Tabular) Data Resource encoding UTF-8: with or without BOM?

Preserve or strip BOM in the decoded string you use in your code?

Preserve or strip BOM when writing to a file?

What should be in the spec?

Reality bites

specs
specs copied to clipboard