Text extraction issue with Inter v4.1 in XeLaTeX-generated PDFs
Text copied from a XeLaTeX-produced PDF using Inter v4.1 contains unexpected characters, while version 3.19 works flawlessly.
To Reproduce
-
Install XeTeX, Poppler, curl, unzip
-
Run the script below in a dedicated directory. Both produced PDFs are attached for reference:
mkdir -p fonts curl -s -L -O --output-dir fonts "https://github.com/rsms/inter/releases/download/v3.19/Inter-3.19.zip" curl -s -L -O --output-dir fonts "https://github.com/rsms/inter/releases/download/v4.1/Inter-4.1.zip" unzip -q -o -d fonts/Inter-3.19 fonts/Inter-3.19.zip unzip -q -o -d fonts/Inter-4.1 fonts/Inter-4.1.zip cat <<EOF > inter-3.19.tex \documentclass{article} \pagestyle{empty} \usepackage{fontspec} \setmainfont{Inter}[ Path = ./fonts/Inter-3.19/Inter Desktop/, Extension = .otf, UprightFont = *-Regular, BoldFont = *-Bold, ItalicFont = *-Italic, BoldItalicFont = *-BoldItalic ] \begin{document} (C++) (100\%) \end{document} EOF cat <<EOF > inter-4.1.tex \documentclass{article} \pagestyle{empty} \usepackage{fontspec} \setmainfont{Inter}[ Path = ./fonts/Inter-4.1/extras/otf/, Extension = .otf, UprightFont = *-Regular, BoldFont = *-Bold, ItalicFont = *-Italic, BoldItalicFont = *-BoldItalic ] \begin{document} (C++) (100\%) \end{document} EOF mkdir -p pdfs xelatex -interaction=batchmode -output-directory pdfs inter-3.19.tex > /dev/null xelatex -interaction=batchmode -output-directory pdfs inter-4.1.tex > /dev/null pdftotext pdfs/Inter-3.19.pdf - | grep -v $'\f' | grep -v '^$' pdftotext pdfs/Inter-4.1.pdf - | grep -v $'\f' | grep -v '^$' -
It outputs the following, even though both PDFs appear fine visually:
(C++) (100%) ?C?????100%? <redacted due to smileys that cannot be pasted>
Expected behavior I expect it to output:
(C++) (100%)
(C++) (100%)
Environment
- OS: macOS 15.1.1, M2
- XeTeX 3.141592653-2.6-0.999996 (TeX Live 2024)
- Inter Regular 4.1
Additional notes You can reproduce the issue by copying text from the provided PDFs. The problem is evident at least in macOS Preview.
Your PDFs appear to be corrupt, or infected, or both. Regardless, I cannot download and open them. Please put the PDFs inside a ZIP, and attach the ZIP file here.
@kenmcd I highly doubt they are either corrupt or infected. It's more likely that some of your protection tools are giving false positives. Anyway, a zip archive is attached.
Appears the encoding is wrong in the v4.1 PDF.
For some reason the (, ), and + are being substituted with the tabular figures alternate glyphs - which have code-points assigned up in the Unicode PUA (Private Use Area).
( EE4E parenleft.case.tf
) EE4F parenright.case.tf
+ EE6A plus.case.tf
If the display font does not have those code-points (like here) - then the .notdef glyph appears.
So the problem is in how the PDF is being created.
@kenmcd I discovered that, for example, c++ produces a readable mapping, while C++ shifts the + higher and uses an alternate glyph. This is great, as it looks much better. If Inter v3.19 doesn't have these alternate glyphs, it would explain why it works without issues (UPD: It turns out Inter v3.19 does have these alternate glyphs after all).
I can confirm that the same input produces correct mappings in a LibreOffice-generated PDF while using both glyphs. Unfortunately, I lack detailed knowledge about how mappings in PDFs work. However, it's clear that all the necessary information to map alternate glyphs correctly exists in the font, as LibreOffice handles it successfully.
This is likely an issue with XeTeX itself or the LaTeX packages it relies on. I will report this to their team and am closing this issue, as it is no longer relevant.
Thank you for such a great font!
I just realized what is going on.
The Contextual Alternates (calt) feature is substituting the case alternate glyphs (which are a little higher).
So parenleft becomes parenleft.case, etc.
But I do not know why Tabular Figures (tnum) is then also applied - so parenleft.case becomes parenleft.case.tf.
tnum is not On by default - so something is enabling it.
calt is On by default, so you would need to disable it if desired.
@kenmcd Sorry, did I rush to close the issue? Feel free to reopen it if you believe it is related to the font.
@kenmcd This completely blew my mind, as the following produces correct mappings by enabling the tnum feature, not disabling it:
\documentclass{article}
\pagestyle{empty}
\usepackage{fontspec}
\setmainfont{Inter}[
Path = ./fonts/Inter-4.1/extras/otf/,
Extension = .otf,
UprightFont = *-Regular,
BoldFont = *-Bold,
ItalicFont = *-Italic,
BoldItalicFont = *-BoldItalic,
RawFeature = +tnum
]
\begin{document}
(C++) (c++) (100\%)
\end{document}
No, I do not think this is an issue (error) with the font.
The automatic calt replacements often confuse users.
According to OpenType specs...
calt default should be On
tnum default should be Off
Just to be sure, I checked Inter v4.1 Regular OTF - and calt and tnum appear to be working as expected.
And as you mention, it works correctly in LibreOffice.
So the tnum being On by default appears to be a problem with XeLaTeX.
Something appears to be broken there.
You should probably file a bug with XeLaTeX about the odd tnum behavior.
The font’s cmap table maps PUA code points to alternate glyphs. This is an outdated, and IMO wrong, practice.
Some PDF producers like XeTeX here will use the cmap mappings for PDF text extraction, others like LibreOffice here, will use the respective code point(s) from input text regardless of the cmap mapping.
Try using \XeTeXgenerateactualtext=1, it might fix the text extraction issue with XeTeX.
Seeing https://github.com/rsms/inter/issues/541, it is seems unlikely that PUA mappings are going away.
@khaledhosny I can confirm that \XeTeXgenerateactualtext=1 works. I will write a PR for README.md, as this is an important peculiarity of the font that can take hours of debugging for those using XeTeX.
The solution with \XeTeXgenerateactualtext=1 fixes only part of the problem. When set to 1, the /ActualText entry is added to the output PDF, improving copy/paste and search functionality in PDF viewers. However, some tools like pdftotext respect this entry, while others, like macOS Preview, do not. Since my goal is to maximise accessibility for my document, my current solution is to fall back to Inter v3.19 for now.
Just a wild guess: could it be worth mapping these glyphs to both the Private Use Area (PUA) and the actual text? I mean having the calt feature produce actual text mappings while also making these glyphs accessible in the PUA through Unicode codes for those who need them there. I'm not a font expert and have no idea if this is feasible. However, if it is, XeTeX is a significant tool, and there are only 32 glyphs involved, as noted in https://github.com/rsms/inter/issues/541.
The proper code points are already mapped to the default glyphs, and it is not possible to map the same code point to different glyphs in cmap.