hyperref
hyperref copied to clipboard
Hyperref applies in bookmarks EN-DASH and EM-DASH TeX ligatures but not double quotes ones, and extra twist with xelatex
With pdflatex
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage[unicode]{hyperref}
\begin{document}
\section{A-section--with---hyphens}
\section{A `section' with ``quotes''}
\end{document}
produces this in .out file
\BOOKMARK [1][-]{section.1}{\376\377\000A\000-\000s\000e\000c\000t\000i\000o\000n\040\023\000w\000i\000t\000h\040\024\000h\000y\000p\000h\000e\000n\000s}{}% 1
\BOOKMARK [1][-]{section.2}{\376\377\000A\000\040\000`\000s\000e\000c\000t\000i\000o\000n\000'\000\040\000w\000i\000t\000h\000\040\000`\000`\000q\000u\000o\000t\000e\000s\000'\000'}{}% 2
and one sees bookmark will contain en-dash and em-dash.
I don't have pdftk and did not try to export the bookmark. The closing two double quotes look in this screenshot as a " but maybe this is just '' as the .out file seems to indicate.
Thus, the issue here is one of coherence between handling of hyphens and of quote characters. Usually people do not disable TeX ligatures. But the hyperref handling of hyphens looks so far ok because if the input did contain -{}- then the bookmark would contain separated hyphens, matching the typeset text.
edit: also << and >> end up as is in bookmarks, thus not matching the typeset text (T1 encoding) and behaving in contradiction with -- and ---.
But with xelatex, a new twist arises. It is useful to disable TeX ligatures, especially for documents not manually written but converted from other sources and for this context if the author uses --- it was with the intent to see it as is in ouput, else a Unicode codepoint would have been used in input. Consider this
\documentclass{article}
\usepackage{fontspec}
\defaultfontfeatures[\rmfamily,\sffamily]{}% turn off TeX ligatures
\setmainfont{FreeSerif}[
Extension = .otf,
UprightFont = *,
ItalicFont = *Italic,
BoldFont = *Bold,
BoldItalicFont = *BoldItalic
]
\setsansfont{FreeSans}[
Extension = .otf,
UprightFont = *,
ItalicFont = *Oblique,
BoldFont = *Bold,
BoldItalicFont = *BoldOblique,
]
\setmonofont{FreeMono}[
Extension = .otf,
UprightFont = *,
ItalicFont = *Oblique,
BoldFont = *Bold,
BoldItalicFont = *BoldOblique,
]
\usepackage[unicode]{hyperref}
\begin{document}
\section{A-section--with---hyphens}
\section{A `section' with ``quotes''}
\end{document}
% Local variables:
% TeX-engine: xetex
% End:
then we get again
\BOOKMARK [1][-]{section.1}{\376\377\000A\000-\000s\000e\000c\000t\000i\000o\000n\040\023\000w\000i\000t\000h\040\024\000h\000y\000p\000h\000e\000n\000s}{}% 1
\BOOKMARK [1][-]{section.2}{\376\377\000A\000\040\000`\000s\000e\000c\000t\000i\000o\000n\000'\000\040\000w\000i\000t\000h\000\040\000`\000`\000q\000u\000o\000t\000e\000s\000'\000'}{}% 2
thus
which is disturbing regarding dashes but better regarding quotes as now the typeset text matches the bookmarks.
edit: as << and >> end up as is in bookmarks, they do match the typeset text in this case, like quotes. But contrarily to hyphens.
There are not ligatures involved in the bookmarks. hyperref does a string substituation in \pdfstringdef:
\HyPsd@Subst{---}\textemdash#1
This means that it is not possible to disable this, if you want single hyphens you will have to insert something that prevents the substituation, e.g. braces.
While one could add more such substituations for the quotes I have a tendency to say that the user should better use proper unicode input “quotes”, or use csquotes. For some code to configure csquotes to print proper quotes see https://github.com/josephwright/csquotes/issues/26#issuecomment-554423422.
There are not ligatures involved in the bookmarks. h
Thanks for reply. It was clear from context what I meant. A "ligature" was a transform of some input into some output.
I agree proper Unicode input is the best. My context is one where latex is one output format, not the core input format. Thus it is better to feed it with already correct quotes.
Thanks for the link regarding csquotes. It looks complicated to integrate this a multi-lingual automated production system.
Thanks for the link regarding csquotes. It looks complicated to integrate this a multi-lingual automated production system.
What is complicated here? These are two lines of code (which could be imho in csquotes.sty directly) and then you only have to use \enquote everywhere.
Thanks for the link regarding csquotes. It looks complicated to integrate this a multi-lingual automated production system.
What is complicated here?
I understand your query, and I send you off-track by the word "multilingual". I was not intending to say each document is multilingual, but that production caters for various languages. Then the complication is that your solution demands to add LaTeX markup which is presumably easy for a human but possibly more complex and prone to fail for automation. Modifying the source prior to feeding it to LaTeX means more powerful tools, and faster at that, than LaTeX, are available for the task of transform of the source. If the automated task was to produce LaTeX mark-up a single error in output can break the PDF build in the end much more easily than could ever achieve simply manipulating Unicode by automated parsing of source. Let me also point out footnote 2 and Table 1 in csquotes manual. The former indicates a recent polyglossia is needed to cope with language variants, the latter lists possible values of language options keys. The automator will have to integrate this upstream. This means effort. It may require adaptation in future. I am well aware such complications is never to be avoided but it can be lifted upstream, prior to feeding LaTeX. Then we solve problems for multiple targets not only PDF via LaTeX. Does that clarify my admittedly ambiguous remark?
This being all said, there is an issue here, especially with xelatex user not wanting transforms applied, anywhere, to -- or --- input. Yes, extra LaTeX mark-up solves this, and this is what I do. Note that because of that I have to escape all hyphens to a LaTeX macro, I have to use \pdfstringdefDisableCommands, and it is all possible only because mechanisms are already in place to avoid doing this in URLs fetched to \href, or \url.
I understand the legacy here, and that using en-dash or em-dash in the bookmarks matching -- and ---, and the fact that hyperref automatically removes the {} when encountering -{}- keepingin this process memory to not to output an en dash in the bookmarks (be it in pdfencoding or UTF-16BE iirc), was pretty cool with pdflatex.
But it was at odds with absence of handling of the << and >> ligatures, and absence of transformation of straight quotes ` and ' into curly ones, same for the iterated ones. I have not checked whether or not PDFDocEncoding actually supports the target glyphs, but definitely UTF-16BE does. Thus, there is here to my mind a long-standing hyperref issue.
This being all said, there is an issue here, especially with xelatex user not wanting transforms applied, anywhere, to -- or --- input.
But it was at odds with absence of handling of the << and >> ligatures,
well as I said: ligatures and so also any settings trying to influence ligatures are not relevant here. hyperref has a rather fixed and hard coded set of replacements/definitions it uses on the input to get something sensible in the outlines. One could discuss if converting-- into an en-dash is sensible, and if it would make sense to convert << into something else but that both are ligatures for some fonts should not matter. bookmarks/outlines have quite different requirements than typesetting text.
bookmarks/outlines have quite different requirements than typesetting text.
I don't think bookmarks/outlines have a will and soul of their own. In the end it is the user whose taste matters. As a user I see the screenshots as in my OP and I consider something is wrong.
bookmarks/outlines have quite different requirements than typesetting text.
I don't think bookmarks/outlines have a will and soul of their own. In the end it is the user whose taste matters. As a user I see the screenshots as in my OP and I consider something is wrong.
This doesn't change the fact that typesetted text uses fonts while bookmarks are pdf strings. Not everything that works with fonts works also in such strings. This means that you need to use different tools to fine tune them. hyperref provides \texorpdfstring exactly for this reason.