pdfalto
pdfalto copied to clipboard
characters not recognised
I'm uploading this file 1903.07791.pdf
where some characters are not recognised:
the result is this:
or, this, in the text (please ignore the tags...):
Room temperature electrical resistivity was decreased down from 300 mcm for x = 0 to 8 mcm for x = 0.4. However, the temperature dependence of electrical resistivity was still insulating for x 0.4. In the present study, we show that Bi-rich composition up to ca. x = 0.8 can be obtained by optimizing synthesis temperature.
@lfoppiano you need to give me details about the command options you are using, is the output from the actual configuration from grobid ?
@Aazhar this output was the result from a module using the latest Grobid version. I did not process it into pdfalto directly.
After re-checking, this is the output with pdf2xml: 1903.07791.pdfxml.txt
and this is the output with pdfalto: 1903.07791.txt
If I'm not wrong, there is some missing texts in the pdfalto version, but maybe is better if you double check.
To reproduce the issue you can use grobid-quantities, generate the training and pick up the txt version of the training data. If you use grobid 0.5.4 (the default for grobid-quantities) you will have the pdf2xml output, while if you update to 0.5.5-SNAPSHOT you will have the pdfalto output.
so this document is an example of problematic glyphs (not correctly mapped to corresponding unicode code point), so since the ocr feature will be soon implemented (still fixing model parameters and training data for better generalisation) this output is correct.
regarding the results diff (pdfalto/pdf2xml), unless you are using ocr or reading order features the output should not differ, except for notations/greek glyph for which some common font rules are used.
@Aazhar
I am very interested in this (OCR of unknown glyphs), let me know if I can help with anything (deep learning, dataset preparation etc.).
You rock ;)
Another test document here: hal-00720564.pdf
This snippet has m s -1 which are extracted (2.57–4.63 mÆs)1):

pdf2xml and Preview have the same output, so I would say is not urgent to fix it, I'm just giving an additional test document 😄
I found a similar problem with ligatures, example file https://arxiv.org/pdf/1906.08479.pdf
The character in question is a "fi" ligature, used in part of "mean-field", for example. I compiled pdfalto from the master branch and run ./pdfalto -f 48 -l 48 1906.08479.pdf 1906.08479.alto.xml, and fi ligature is getting dropped (I removed some zero attributes for easier reading):
<String sid="p48_s151" ID="p48_w151" CONTENT="mean-" HPOS="492.065" VPOS="327.587" WIDTH="27.5184" HEIGHT="9.7091" STYLEREFS="font1"/>
<String sid="p48_s152" ID="p48_w152" CONTENT="eld" HPOS="525.57" VPOS="327.587" WIDTH="13.0108" HEIGHT="9.7091" STYLEREFS="font1"/>
This is from [CDL03] reference line and word "field" is pretty popular in this pdf, so example is abundant.
Is this the same problem?
Just curious if any works has been done on the ligature issue. I'm also seeing it occur with "ff" and "ffi" as in the words "effect" and "efficacy"
@bmorton1, so far the ligature are left as such, so if we have a \uFB00, we leave it as such and we don't rewrite it as 2 characters ff. We considered it is out of the scope of pdfalto (some users could be interested by keeping the ligatures). Of course we could also include it in pdfalto, it's open to debate :)
On the other hand we handle in pdfalto the character composition, because they are introduced simply for saving glyphs, there is no reason to keep the sequence 'e for é (it's something very standardised too).
Then the tools using pdfalto can handle ligature on their own, it's not sophisticated. For instance in GROBID, which uses pdfalto, we apply several steps of character normalizations (ligature, unicode character family normalization, etc.). For the ligature, we use this mapping:
// ligature
case '\uFB00': {
res += "ff";
break;
}
case '\uFB01': {
res += "fi";
break;
}
case '\uFB02': {
res += "fl";
break;
}
case '\uFB03': {
res += "ffi";
break;
}
case '\uFB04': {
res += "ffl";
break;
}
case '\uFB06': {
res += "st";
break;
}
case '\uFB05': {
res += "ft";
break;
}
case '\u00E6': {
res += "ae";
break;
}
case '\u00C6': {
res += "AE";
break;
}
case '\u0153': {
res += "oe";
break;
}
case '\u0152': {
res += "OE";
break;
}
Note that it's a different question than raised in this issue, which is that some glyphes cannot correctly be mapped to a unicode - and this incorrectly mapped glyph could be one corresponding to a ligature (for that issue, we plan to use a bit of "local" OCR, because there is no other solution).
@kermitt2 is this code for ligatures in a released Grobid version, or is it only in master branch?
Also, in the example that I gave it looks like these ligatures do not get passed to consumer by pdfalto. They are simply dropped, so there is no chance for Grobid to handle these ligatures. Maybe this PDF is writing ligatures in some weird way?
When I open my example PDF in Firefox and copy a span, I get
Role of the interaction matrix in mean-eldspin glass models
This is a character \u001B, doesn't look right.
MacOS Preview.app correctly copies it as:
Role of the interaction matrix in mean-field spin glass models
Chrome copies almost the same string as Firefox, except it enters \n between "\u001Beld" and "spin". I think they both use pdf.js, but I'm not 100% sure. Safari gives the same correct output as Preview.
I could investigate what happens in pdfalto, but I don't know where to begin. It would be helpful if you can point me in the general direction of where to look at it in pdfalto source.
@vsolovyov the mapping is in both GROBID master and released versions.
However, if the character is "dropped", it means that the unicode is not resolved for this glyph. It relates in particular to the way PDF embeds fonts. In the case we have a glyph that is in a locally embedded font, the unicode for this glyph is often not the expected unicode (for your example it should be \u001B) but a code in the free unicode range (thus the placeholder � in your case), so there is no way to understand which character this glyph actually represents.
This is mentioned here and in some issues for grobid in the last years.
There is happening usually with special characters, in particular mathematical symbols, the fact that it is a ligature glyphs is not specific.
It was already mentioned in other issues I think. MacOS in general includes some advanced PDF processing, in particular more proprietary fonts (from Adobe in particular) and asaik it is doing on-the-fly OCR for unresolved glyphs. So when testing with Preview, all these locally unresolved glyphs are solved - it's actually great stuff and I hope to add that to pdfalto something similar, so that the tool can reach the level of PDF support by MacOS. Other linux libraries (derived from xpdf) and open source PDF parsing libraries are also "closer" to the actual PDF encoding, so you will see the problem.
It also means in practice that it is not possible to test the encoding of a PDF with MacOS Preview, because it does behind the scene some excellent reconstruction (also for columns).
I hope it clarifies this problem!
@kermitt2 thank you very much for the write up, it does clarify the problem. I haven't expected that MacOS does on-the-fly OCR for unresolved glyphs, and thought that the problem with ligatures is much easier than it turns out to be.