inter Text extraction issue with Inter v4.1 in XeLaTeX-generated PDFs

Text copied from a XeLaTeX-produced PDF using Inter v4.1 contains unexpected characters, while version 3.19 works flawlessly.

To Reproduce

Install XeTeX, Poppler, curl, unzip

Run the script below in a dedicated directory. Both produced PDFs are attached for reference:

mkdir -p fonts
curl -s -L -O --output-dir fonts "https://github.com/rsms/inter/releases/download/v3.19/Inter-3.19.zip"
curl -s -L -O --output-dir fonts "https://github.com/rsms/inter/releases/download/v4.1/Inter-4.1.zip"
unzip -q -o -d fonts/Inter-3.19 fonts/Inter-3.19.zip
unzip -q -o -d fonts/Inter-4.1 fonts/Inter-4.1.zip

cat <<EOF > inter-3.19.tex
\documentclass{article}
\pagestyle{empty}
\usepackage{fontspec}

\setmainfont{Inter}[
    Path           = ./fonts/Inter-3.19/Inter Desktop/,
    Extension      = .otf,
    UprightFont    = *-Regular,
    BoldFont       = *-Bold,
    ItalicFont     = *-Italic,
    BoldItalicFont = *-BoldItalic
]

\begin{document}
(C++) (100\%)
\end{document}
EOF

cat <<EOF > inter-4.1.tex
\documentclass{article}
\pagestyle{empty}
\usepackage{fontspec}

\setmainfont{Inter}[
    Path           = ./fonts/Inter-4.1/extras/otf/,
    Extension      = .otf,
    UprightFont    = *-Regular,
    BoldFont       = *-Bold,
    ItalicFont     = *-Italic,
    BoldItalicFont = *-BoldItalic
]

\begin{document}
(C++) (100\%)
\end{document}
EOF

mkdir -p pdfs
xelatex -interaction=batchmode -output-directory pdfs inter-3.19.tex > /dev/null
xelatex -interaction=batchmode -output-directory pdfs inter-4.1.tex > /dev/null

pdftotext pdfs/Inter-3.19.pdf - | grep -v $'\f' | grep -v '^$'
pdftotext pdfs/Inter-4.1.pdf - | grep -v $'\f' | grep -v '^$'

It outputs the following, even though both PDFs appear fine visually:

(C++) (100%)
?C?????100%? <redacted due to smileys that cannot be pasted>

Expected behavior I expect it to output:

(C++) (100%)
(C++) (100%)

Environment

OS: macOS 15.1.1, M2
XeTeX 3.141592653-2.6-0.999996 (TeX Live 2024)
Inter Regular 4.1

Additional notes You can reproduce the issue by copying text from the provided PDFs. The problem is evident at least in macOS Preview.

inter-3.19.pdf inter-4.1.pdf

Dec 01 '24 17:12 igrmk

Your PDFs appear to be corrupt, or infected, or both. Regardless, I cannot download and open them. Please put the PDFs inside a ZIP, and attach the ZIP file here.

Dec 01 '24 20:12 kenmcd

@kenmcd I highly doubt they are either corrupt or infected. It's more likely that some of your protection tools are giving false positives. Anyway, a zip archive is attached.

inter-pdfs.zip

Dec 01 '24 20:12 igrmk

Appears the encoding is wrong in the v4.1 PDF. For some reason the (, ), and + are being substituted with the tabular figures alternate glyphs - which have code-points assigned up in the Unicode PUA (Private Use Area). ( EE4E parenleft.case.tf ) EE4F parenright.case.tf + EE6A plus.case.tf If the display font does not have those code-points (like here) - then the .notdef glyph appears. So the problem is in how the PDF is being created.

Dec 02 '24 19:12 kenmcd

@kenmcd I discovered that, for example, c++ produces a readable mapping, while C++ shifts the + higher and uses an alternate glyph. This is great, as it looks much better. If Inter v3.19 doesn't have these alternate glyphs, it would explain why it works without issues (UPD: It turns out Inter v3.19 does have these alternate glyphs after all).

I can confirm that the same input produces correct mappings in a LibreOffice-generated PDF while using both glyphs. Unfortunately, I lack detailed knowledge about how mappings in PDFs work. However, it's clear that all the necessary information to map alternate glyphs correctly exists in the font, as LibreOffice handles it successfully.

This is likely an issue with XeTeX itself or the LaTeX packages it relies on. I will report this to their team and am closing this issue, as it is no longer relevant.

Thank you for such a great font!

Dec 02 '24 20:12 igrmk

I just realized what is going on. The Contextual Alternates (calt) feature is substituting the case alternate glyphs (which are a little higher). So parenleft becomes parenleft.case, etc. But I do not know why Tabular Figures (tnum) is then also applied - so parenleft.case becomes parenleft.case.tf. tnum is not On by default - so something is enabling it. calt is On by default, so you would need to disable it if desired.

Dec 02 '24 20:12 kenmcd

@kenmcd Sorry, did I rush to close the issue? Feel free to reopen it if you believe it is related to the font.

Dec 02 '24 20:12 igrmk

@kenmcd This completely blew my mind, as the following produces correct mappings by enabling the tnum feature, not disabling it:

\documentclass{article}
\pagestyle{empty}
\usepackage{fontspec}

\setmainfont{Inter}[
    Path           = ./fonts/Inter-4.1/extras/otf/,
    Extension      = .otf,
    UprightFont    = *-Regular,
    BoldFont       = *-Bold,
    ItalicFont     = *-Italic,
    BoldItalicFont = *-BoldItalic,
    RawFeature     = +tnum
]

\begin{document}
(C++) (c++) (100\%)
\end{document}

Dec 02 '24 21:12 igrmk

No, I do not think this is an issue (error) with the font. The automatic calt replacements often confuse users.

According to OpenType specs... calt default should be On tnum default should be Off

Just to be sure, I checked Inter v4.1 Regular OTF - and calt and tnum appear to be working as expected. And as you mention, it works correctly in LibreOffice.

So the tnum being On by default appears to be a problem with XeLaTeX. Something appears to be broken there. You should probably file a bug with XeLaTeX about the odd tnum behavior.

Dec 02 '24 21:12 kenmcd

The font’s cmap table maps PUA code points to alternate glyphs. This is an outdated, and IMO wrong, practice.

Some PDF producers like XeTeX here will use the cmap mappings for PDF text extraction, others like LibreOffice here, will use the respective code point(s) from input text regardless of the cmap mapping.

Try using \XeTeXgenerateactualtext=1, it might fix the text extraction issue with XeTeX.

Dec 03 '24 08:12 khaledhosny

Seeing https://github.com/rsms/inter/issues/541, it is seems unlikely that PUA mappings are going away.

Dec 03 '24 08:12 khaledhosny

@khaledhosny I can confirm that \XeTeXgenerateactualtext=1 works. I will write a PR for README.md, as this is an important peculiarity of the font that can take hours of debugging for those using XeTeX.

Dec 03 '24 13:12 igrmk

The solution with \XeTeXgenerateactualtext=1 fixes only part of the problem. When set to 1, the /ActualText entry is added to the output PDF, improving copy/paste and search functionality in PDF viewers. However, some tools like pdftotext respect this entry, while others, like macOS Preview, do not. Since my goal is to maximise accessibility for my document, my current solution is to fall back to Inter v3.19 for now.

Dec 03 '24 13:12 igrmk

Just a wild guess: could it be worth mapping these glyphs to both the Private Use Area (PUA) and the actual text? I mean having the calt feature produce actual text mappings while also making these glyphs accessible in the PUA through Unicode codes for those who need them there. I'm not a font expert and have no idea if this is feasible. However, if it is, XeTeX is a significant tool, and there are only 32 glyphs involved, as noted in https://github.com/rsms/inter/issues/541.

Dec 03 '24 14:12 igrmk

The proper code points are already mapped to the default glyphs, and it is not possible to map the same code point to different glyphs in cmap.

Dec 03 '24 14:12 khaledhosny