neatroff_make
neatroff_make copied to clipboard
Text cannot be copied/extracted from neatpdf-generated files with OpenType fonts
PDF files generated by neatpdf look fine in a PDF viewer, but the text in them cannot be copied to the clipboard nor extracted with pdftotext when (some?) OpenType fonts are used. For example the following input file:
.fp - M Minion-Pro-Regular
.fp - G Garamond-Premier-Pro
.ft R
R: “Vimperator Vjol” fi
.sp
.ft M
Minion Pro: “Vimperator Vjol” fi
.sp
.ft G
Garamond Premier Pro: “Vimperator Vjol” fi
converted to PDF which is then run through pdftotext results in the following output:
R: “Vimperator Vjol” fi
.JOJPO 1SP7JNQFSBUPS 7KPMw ĕ
(BSBNPOE 1SFNJFS 1SP7JNQFSBUPS 7KPMu ઔ
So text which uses a Postscript font (the default R) comes through fine, text that uses the two Adobe OpenType fonts is garbled. I didn't test with a traditional TrueType font yet.
It is possible to work around this by using neatpdf and ps2pdf, but do you think this could be fixed in metapdf?
Use case: I'm updating my CV and it looks great, but unfortunately some companies process applications with automated systems and a non-machine-readable PDF may be a problem.
vuori [email protected] wrote:
PDF files generated by
metapdflook fine in a PDF viewer, but the
Do you mean neatpdf (Neatroff's PDF post-processor)?
text in them cannot be copied to the clipboard nor extracted with
pdftotextwhen (some?) OpenType fonts are used. For example the following input file:..fp - M Minion-Pro-Regular ..fp - G Garamond-Premier-Pro ..ft R R: “Vimperator Vjol” fi ..sp ..ft M Minion Pro: “Vimperator Vjol” fi ..sp ..ft G Garamond Premier Pro: “Vimperator Vjol” ficonverted to PDF which is then run through
pdftotextresults in the following output:R: “Vimperator Vjol” fi ..JOJPO 1SP7JNQFSBUPS 7KPMw ĕ (BSBNPOE 1SFNJFS 1SP7JNQFSBUPS 7KPMu ઔSo text which uses a Postscript font (the default
R) comes through fine, text that uses the two Adobe OpenType fonts is garbled. I didn't test with a traditional TrueType font yet.
Does the output of both neatpost and neatpdf have this problem?
It is possible to work around this by using
metapostandps2pdf, but do you think this could be fixed inmetapdf?
The Adobe's PDF Reference has a section on extracting text from PDF (§5.9). I have to examine how much work that requires.
Ali
Yes, sorry, it was getting pretty late and I kept writing "meta" when I meant "neat". The problem only occurs with neatpdf. Postscript output from neatpost has no problems.
The page object in the PDF output by neatpdf starts like this:
/Times-Roman.0 10 Tf
1 0 0 1 72.00 780.00 Tm
[<321a> -250 (h6) 20 (IM) 10 (PERA) 10 (T) 10 (OR) -250 (6) 30 (JOLv) -250 (l)] TJ
/MinionPro-Regular 10 Tf
1 0 0 1 72.00 756.00 Tm
[<002e> 10 <004a004f004a0050> 10 <004f> -230 <0031> 10 <0053> 10 <0050001b> -230 <0069> 10 <0037> 50 <004a004e> 20 <0051> -10 <0046005300420055> 10 <0050> 10 <0053> -230 <0037> 50 <004b> -10 <0050> 10 <004d> 30 <0077> -230 <0115>] TJ
Since Identity-H mapping is being used for the OpenType fonts, I guess the arguments to the second TJ command are CIDs? Maybe the problem is the lack of a "ToUnicode CMap" (PDF spec 9.10.3) as described here: https://tex.stackexchange.com/questions/526157/what-is-identity-h-encoding-should-it-be-avoided-and-if-so-how ?