grobid
grobid copied to clipboard
Wrong figure recognition
example paper: https://journals.aps.org/prc/abstract/10.1103/PhysRevC.100.014306 (same with #781 )
Fig 1(missed)
Fig 2(wrong head and figDesc)
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0" coords="5,513.20,380.04,36.03,7.88;5,304.15,391.00,4.48,7.88;5,308.63,388.87,4.66,5.98;5,308.63,395.50,2.99,5.25;5,313.79,391.00,32.61,7.88;5,348.49,389.35,5.98,5.25;5,355.00,390.21,193.95,8.97;5,304.15,401.96,245.08,8.57;5,304.15,412.13,245.08,9.29;5,304.15,423.09,245.09,8.97;5,304.15,434.05,173.68,8.97">
<head>B and the 1 + 1 0</head>
<label>11</label>
<figDesc>state of 10 B. The panels (a) are calculated with the THSR + pair wave function pair with optimized parameters. The panels (b) are obtained by using only the pairing term p with parameter c = 0. For all these calculations, β parameters are set to optimized values in the corresponding THSR + pair wave functions.</figDesc>
</figure>
Fig 3(correct)
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1" coords="7,53.09,230.09,101.28,7.88;7,156.69,228.44,5.98,5.25;7,163.20,228.28,123.01,9.69;7,41.14,240.26,245.09,8.97;7,41.14,252.00,104.99,7.88">
<head>FIG. 3 .</head>
<label>3</label>
<figDesc>FIG. 3. Energy curve of the 10 B(3 + 0) with respect to the parameter d. The parameter c is set to be d = 1 − c. Other parameters are fixed at the optimized values.</figDesc>
</figure>
v6.2.xml -> v7.0.xml
https://gist.github.com/elonzh/f4e59232ddaded31ee23735f994ea4b6/revisions
both versions have the same issue.
-
V6.2
-
V7
I think this is due to the vector graphics, there's still something I need to fix to take them into account when aggregating the figure zones. This is coming from pdfalto which provides the coordinates of vector graphics differently as before, and I think they are mostly not used now (likely one of the reasons why the figure are working so badly overall currently).
I think this is due to the vector graphics, there's still something I need to fix to take them into account when aggregating the figure zones. This is coming from pdfalto which provides the coordinates of vector graphics differently as before, and I think they are mostly not used now (likely one of the reasons why the figure are working so badly overall currently).
Thanks for your work, I will try to follow up on your work and make some contributions to this project.
So indeed vector graphics are currently not considered when recognizing figures, leading to such errors (all these reported figures are vector graphics).
Some notes before I forget:
-
to be taken into account currently, svg files must be generated and
processVectorGraphics
true: for more flexibility we should introduce in pdfalto an option to generate the svg files but not the bitmap files and use the svg option by default -
we have one svg file per page when some vector graphics is present, this svg contains all the vector graphics of the page (so possibly several different graphic figures). In Grobid
VectorGraphicBoxCalculator
is used to segment/aggregate non-trivial vector graphics elements. The issue is how the coordinates of these elements are found currently, usingXQueryProcessor
. It assumes some@x
@y
coordinates are present to get a bounding box for svg<g>
. This is not always working when we simply have paths. -
to better parse svg, we probably should use Apache Batik instead of
XQueryProcessor
. Batik will parse the svg file and provides bounding boxes via theelement.getBBox()
inVectorGraphicBoxCalculator
.
for more flexibility we should introduce in pdfalto an option to generate the svg files but not the bitmap files and use the svg option by default
See https://github.com/kermitt2/pdfalto/issues/128
to better parse svg, we probably should use Apache Batik instead of XQueryProcessor. Batik will parse the svg file and provides bounding boxes via the element.getBBox() in VectorGraphicBoxCalculator.
Done in Grobid branch fix-vector-graphics
, XQueryProcessor
was replaced by parsing of SVG by Apache Batik. Bounding boxes of SVG elements are generated by Apache Batik too, leading to better support of SVG. We reuse then the existing vector box aggregation method, which leads to good figure content recognition.
But this then leads to two problems:
-
Some SVG files are huge: for instance Fig. 2 of the example above is a 60MB SVG file, with 378,268 independent
<g>
elements, so introducing the same amount of bounding boxes. They are then well aggregated in one vector box, but it takes several seconds. There is maybe no real solution for this, because even rasterizing the svg in a bitmap takes several seconds too. -
The way figures and tables are currently recognized via the fulltext model and the layout tokens is not reliable enough and probably won't be even with more training data. Many clean figures content boxes are found but lost because we have no figure sequence found. Similarly we have incorrect figure sequence found while there is no figure box content around.
Proposal for the second point: redesign how figures and tables are recognized by removing them from the full text model and introduce their own segmentation models. These model would start from every aggregated graphics box in the document and try to extend these boxes with a dedicated sequence labeling model capturing blocks around based on layout clues+text as usual.
Same here. +1