archieml.org
archieml.org copied to clipboard
Should keys be case-sensitive?
I wonder whether non-programmers intuitively distinguish case-sensitivity — especially editors and copy editors. Having both "headline" and "Headline" doesn't seem useful, and treating them the same (e.g. forcing all keys to become lowercase in the result) may reduce a headache or two. Have you considered the tradeoffs?
Definitely worth considering. I'll check with @archietse and @mericson on how it's gotten used over the past few months, and see if there have been any cases where case sensitivity would matter. I'm guessing not, but worth a look.
The tricky part that case insensitivity would introduce is in data access, I think. If we downcase all keys, if someone writes "Headline" in the document, I'd think it's reasonable to expect they could access it as "Headline" in the output. And having the parsers output objects where access to "headline" or "Headline" is case-insensitive would introduce a lot of overhead - I like that the current parsers just return normal JS objects and Ruby Hashes, both are which are case sensitive.
If this were just key/values, I would say parsers should do it after the "official" processing purely as an option. However, the problematic case is in arrays:
[array]
headline: First
subhed: Anything
Headline: Second
subhed: Something else
As it stands, array
would only wind up with one object here, because the initial key headline
is never encountered again. And this seems very likely to occur at some point. To consider:
- is this an issue we want to address in the spec? (I'd say yes, but want others to weigh in)
- would this be implemented as an option? what should it default to? (this would also be the first parsing option we allow, so how should options be implemented?)
- does possible unicode support for keys (#3) complicate this?
Certainly making keys case insensitive would be better for editors and copy editors, but we also have some people who use camel-case to make keys more readable, which would get lost if we had case-insensitivity. @abstrctn made a good suggestion that this could be an option in the parser that you can toggle as you wish.
Been reading up on case sensitivity and unicode, and have some thoughts (this is new to me though, so forgive me if I'm wrong about any of this).
There are two types of situations in which parsers would need to look at case: comparing two strings for equality, and possibly mapping the characters of a key to a consistent case for storage (e.g., storing all keys in lowercase).
Comparison looks relatively straightforward, and seems like the intended use case for case folding. There are well-defined, locale-specific rules for checking whether any two characters are the same.
There are some scripts that complicate matters with words whose characters are different in upper- versus lower- case, but nevertheless are linguistically equivalent, and which (I think) case-folding would not be enough to test equality.
Other letters do not have a single uppercase equivalent. For example, the German language uses the "sharp-s" character in words like "groß". This letter's uppercase equivalent is a two letter sequence 'SS' ("GROSS").
However, that seems outside the scope for a parser, even one that tries to be pretty forgiving.
How to map characters to a normalized format for the output seems trickier. In ASCII, it's straightforward to mandate that users access values using lowercase keys, but this would cause ambiguity in some languages (should GROSS
become groß
or gross
?). The output of case folding seems to be intended only for internal use for this reason.
Proposal
So that parsers have to do as little work as possible (and have the fewest opportunities for mistakes), I would propose that we allow for comparison by way of case-folding for the purposes of:
- identifying whether a key should overwrite an existing key, even if they have different cases
- identifying item delimiter keys within arrays
- navigating to a nested key (keys within
{Scope}
should match those same keys prefixed withscope.
)
But that keys should never be mapped / translated, and that where there is a conflict, the key should match the casing of the line corresponding to the resulting value:
key: lowercase
KEY: UPPERCASE
=> { "KEY": "UPPERCASE" }
KEY: UPPERCASE
key: lowercase
=> { "key": "lowercase" }
Or in the case of item delimiter keys:
[array]
Headline: UPPERCASE
headline: lowercase
[]
=> { "array": [{"Headline": "UPPERCASE"}, {"headline": "lowercase"}] }
This would be enough to produce a consistent document structure. But we would leave it up to the consumer to allow access to the output in a case-insensitive way if they so desire. Which might mean just calling .toLowerCase() on every key, but we shouldn't assume so.