Best-Practices-for-TEI-in-Libraries
Best-Practices-for-TEI-in-Libraries copied to clipboard
discussion of hyphenation: mismatch between code sample and note
In the third row of our table, we have:
Colloquial name | Appearance in source document | Encoding | Note |
---|---|---|---|
Soft hyphen | UTF-8 is a char- acter encoding for Unicode. | UTF-8 is a char<pc force="strong">-</pc><lb break="yes"/>acter encoding for Unicode. |
As in the first example, the use of weak as the value of force indicates that the encoder considers "character" to be a single orthographic token where the hyphen is only indicating that the word is broken across a line. The use of no as the value of break also indicates that the line break occurs inside an orthographic token (single word) which is broken across a line. |
The code sample involves force="strong"
and break="yes"
, but the note implies that it has force="weak"
and break="no"
. It's been too long since I thought about any of this, so I'm not even sure what is correct here. I vaguely recall that @sydb wrote this section?
The Note is correct and the encoding incorrect. It should be <lb break="no"/>
when the line break is inside a word.
I just checked the Guidelines on <pc>
and again, the encoded example is backwards. The @force
attribute is "strong" when the punctuation mark is a word separator, and "weak" when it is not. In this case, the hyphen appears in side the word "character" so it doesn't serve as a word break character.
I think it should be:
char<pc force="weak">-</pc><lb break="no"/>acter
Also, we might try to force a linebreak where the hyphen is in the source document rendition so the hyphen doesn't look odd.
Thank you for the quick detective work!