html
html copied to clipboard
Add &nnbsp; entity for U+202F
There's for U+00A0. It's a full-width no-break space. It can be used between numbers and their short unit names, or in other places.
Typography and regional norms require (or at least recommend) using a thin no-brak space (or narrow no-break space) in several places:
- As thousands separator, Source or DIN 5008 (to avoid ambiguous presentation of point or comma)
- Between abbreviated words like “z. B.” (German: zum Beispiel), Source
- As fine space before certain punctuation in French, Source
(These are the first and best sources I could find now. There may be better or more authoritative sources available, but they're usually hard to find.)
While it is technically possible to create a keyboard layout that produces this character, not many users have this installed and even then it's hard to distinguish it from other space characters when reading and revising text. Most editors don't even show a replacement symbol for this space character.
AFAIK Wikipedia suggests writing in these places. And that's probably a good idea in team projects as well. But this is actually the wrong character in these places.
To use the correct narrow no-break space, one has to use a different HTML entity representation, like   or   which are frankly hard to remember or recognise.
As a solution, the new entity &nnbsp; should be added to HTML to make it easy to write readable text following the correct typographic rules and recommendations.
If new entity will be added effort should be coordinated with MathML to keep entity definitions synchronized -- https://w3c.github.io/xml-entities/
Mozilla is not interested in this. I guess that's a bad starting point already? I don't have the best experiences with the Chrome developers, maybe I'll try it there anyway.
Unfortunately, entities is something that's not extensible in HTML, so I can't even run my own little happy solution.
If HTML standard evolves, Mozilla, and others, must follow the new specifications, that's an evidence.
I'm currently interested about having &nnbsp;, or equal, entity for a French wiki project, as narrow non-breaking space is recommended in some cases, as explained by ygoe.
Futhermore, HTML entities exist for a numerous characters, in my opinions, almost never used, like ≺ and such.
In my opinion this would be extremely useful for French authors, but also for other languages. The NNBSP character was initially added to Unicode for Mongolian suffix handling, where it is important to visually distinguish between spaces separating suffixes and those separating words. It is also being proposed as an ideal fit for a morphological separator in the numerous languages written in the Canadian Aboriginal script (see https://github.com/w3c/amlreq/issues/4). An entity would significantly help authors produce correct (and better machine-readable) text in all these languages.
[@annevk could you add i18n-mlreq and i18n-amlreq labels to the repo, so i can alert those folks to the discussion? Thanks.]
Here is an extension of this issue, which i can raise in a new issue if preferred.
There are other invisible characters for which a named character reference would be very useful for producing correctly authored Unicode text, for the same reasons as mentioned in the first comment. Here, for example, is a list of formatting characters used for Arabic, but most are essential characters for all RTL script-based languages.
Characters with entities:
‍
‌
‏
‎
Characters without entities: RLI LRI FSI PDI RLE LRE PDF RLM LRM CGJ ALM
Keyboards generally don't address the problem of inputting the characters, but it's also a problem that the characters themselves are invisible. It would really help to have Named character references. As someone who works with people who use these languages, and works with them myself, it seems to me that from a user's perspective it would be well worth the effort to add them. I don't remember why that hasn't happened before now.
(New labels are to be introduced through https://github.com/whatwg/meta.)
(New labels are to be introduced through https://github.com/whatwg/meta.)
I just filed https://github.com/whatwg/meta/issues/182
I believe I've commented previously along the following lines when this has come up:
- For wiki projects, it's irrelevant whether this is in HTML. The wiki software processes the wiki syntax before generating HTML output, so wiki software can introduce whatever macro expansions its developers see fit and users find useful.
- In the case of HTML itself, I think the backward-compatibility characteristics of this feature request are bad. The requested feature doesn't expand the expressiveness of HTML in any way: You can already express U+202F unescaped in UTF-8 or escaped as a numeric reference. However, if a named entity was added, it would break in the currently-existing HTML parsers (not only in the currently-existing browsers). This could either lead to unwanted breakage or to lead to non-usage of the feature (i.e. using the numeric form or unescaped UTF-8 anyway for better compat).
- Making this change would set a precedent for others to request named entities for characters they find important causing a repeat of the previous point over and over again.
Curious to hear what others think, but I tend to agree. Perhaps the best course of action here would be to update https://github.com/whatwg/html/blob/master/FAQ.md and close these type of feature requests.
@hsivonen I think what makes this request a bit different from others is that it's for invisible characters. As @r12a points out, it's hard to work with invisible characters. And letting wiki markup handle it isn't helpful at all: this is something that needs to work across all input modes into HTML, because it has to be reliable and consistent to be useful to the people who need them.
So while I understand your general premise about the update cycle being, potentially, 5 years or so, I think it's worth it in this case. If we want to take the time to batch up all the invisible characters we need to care about so we can do it at once, let's do that and make a coordinated update to the parser that makes languages that need invisible characters easier to typeset in HTML.
What wikis or any other applications do is entirely irrelevant here. And following @hsivonen 's argumentation, any progress is bad. So why care at all? Just leave it forever as it was defined some 30 years ago. Never change a running system (which is generally bad advice).
I'm fully aware that not all existing HTML parsers and renderers will properly handle this overnight when it's added. It'll take time. But we're in the fortunate (and also unfortunate) situation that the number of relevant HTML parsers in use is very limited, and these are actively maintained and automatically updated most of the time. So changes like this will eventually trickle through to all users and in a few years we can benefit from it without worrying too much. If you're not willing to wait such a long time, you shouldn't work in such projects. Web projects already have a large number of dependencies on browsers and this could be just one of them. As soon as you discover that all browsers that support everything else you already need also support this entity, you can safely use it.
Also, of course I can use any Unicode character directly. But this one hasn't made it onto physical or software-defined keyboards. As the NBSP. Or the SHY. Or the MINUS. So this argumentation is moot. Also, of course I can escape any Unicode character by its codepoint value. But nobody will remember those numbers, which means that 1. nobody will be able to fluently write these characters and 2. nobody will be able to fluently read and understand them. This is about as big as a usability fail as it can get. Then, we already have similar entities, like NBSP. Why do they exist? I imagine they exist because they cannot be written with keyboards, their codepoint cannot be remembered, this one is even visually indistinguishable from a more common character (SP) and its use is required sometimes.
While not being strictly "required" and not used as often, NNBSP falls exactly in the same category. So I definitely see reason for its existence as an entity. On the other hand, it doesn't hurt anybody. Any undefined HTML entity is invalid markup, and the "nnbsp" entity is undefined, so it can safely be assigned. As could other invisible Unicode whitespace, like some zero-width characters that affect wrapping and/or hyphenation.
But this one hasn't made it onto physical or software-defined keyboards.
Why is that?
In addition to what @fantasai said, for some characters it’s not about the decision of direct UTF encoding vs. numeric character reference, if there is no named entity reference available, but between the proper character and some inferior replacement character. For invisible characters in particular, that’s either a space or nothing.
I was actioned by I18N to reopen this issue.
We are well aware of #7071 which notes that HTML will not add new named character references. The argument in favor of that policy is that newly added named entities would be broken in all parsers (not just browsers) until such time as the parsers adopted the change and that this would be a barrier to use (users would not adopt the new entities because they do not work).
The sense of I18N is that we want to reopen the discussion anyway. We have a particular interest in the new isolating bidi controls, although other invisible characters are also in this request. Invisible characters are hard to use and harder to manage when authoring a page. When using NCRs, the user must memorize the code point number, which is more prone to error. Most of these characters have memorable short names that lend themselves to entities, such as RLI for U+2067 RIGHT TO LEFT ISOLATE.
Adding the invisible characters to the named entity list would not enable users soon, but could become commonly supported in just a few years.
Please advise how best to prosecute this issues and whether you would like to discuss it in our teleconference or some other venue.
I hope it's okay for an outsider to post to this thread. It seems to me that one of the bigger barriers to adding entities is not merely that existing parsers will not recognise them, but more specifically the manner in which they fail. §13.1.2 of the current HTML 5 spec says ambiguous ampersands are invalid in most contexts. That means all bets are off, but in the various browsers I've tested the entity is displayed literally in the text, which is pretty bad in this particular case. The argument is probably to cope with HTML like <p>I ordered fish&chips; John had a pie.</p>, though I wonder how common this really is. (Are there languages where ampersands are commonly used without surrounding space?) If HTML5 starts adding new entities, this is probably no longer the best behaviour. Would it be better to display U+FFFD in place of the full entity-like-thing when an ambiguous ampersand? At least that makes it clear to a reader that something is off, which the raw entity name may not. If so, might it be sensible to change the spec to mandate this behaviour in advance of actually adding new entities?
Are there languages where ampersands are commonly used without surrounding space?
I'm not sure, but note that in English, there are words like P&G, R&D, and AT&T that don't have the surrounding space.
Apologies for the lack of reply here. I just noticed @aphillips's request to discuss this in person. I'll mark it agenda+ and suggest we discuss it somewhere in January at a time suitable for the US and Europe given the locations of the relevant experts. January 11 looks to be the first available such slot at 9AM PST.
What exactly is the I18N proposal to be discussed?
Introduce named character references for …
- … some specific non-spacing (control) characters?
- … all existing non-spacing characters?
- … all existing and future non-spacing characters?
- … some specific non-spacing and whitespace characters?
- … all existing non-spacing and whitespace characters?
- … all existing and future non-spacing and whitespace characters?
@annevk
Thanks! Let's look for a suitable time slot. I'm not familiar with HTML's call schedule. Would it be possible to do a week later (assuming you have calls weekly??) such as the 18th? That way we could include @r12a, who has previously contributed on this thread. We can also host you in our regular call (Thursdays at 7 AM Pacific)
@Crissov
We would like to discuss the possibility of additions of this type in general. We have specific existing non-spacing characters and, it appears, perhaps a few specific whitespace characters in mind. Obviously, if we "broke the dam" on additions, there is also the question of establishing criteria for any future additions. We do not propose to add named entities in a broad or general sense.
Maybe, what sets this one apart from others is that it's invisible. You could potentially use smart input methods to generate just about any visible character and anybody else reading the document would see it. Of course you can also use smart input methods to generate special white-space characters (like I do with my modified keyboard layout), but the problem is that other people editing the document likely won't see it if they're not familiar with the various spaces and have the tools to see them. So to be safe, it could be a good solution to use a presentation that makes it visible. is already widely used in Wikipedia content, for example.
So if you're looking for criteria, this might be one. 🙂
@aphillips for that time slot the next one is Feb 22. There are two other meetings, but one is not useful for Europe and one is not useful for the US. Getting WHATNOT participants to join another meeting could maybe work, but it probably requires explicitly pinging some people and making sure they can all make it which is not work I can sign up for right now. Maybe next year.
@annevk Thanks. This isn't urgent, so let's go for February? Thinking aloud, perhaps we (meaning me) should make a list of I18N issues that could use attention ahead of time and we can have a section of the call for I18N?
Sounds good to me!