grobid icon indicating copy to clipboard operation
grobid copied to clipboard

Wrong figure recognition

Open elonzh opened this issue 3 years ago • 5 comments

example paper: https://journals.aps.org/prc/abstract/10.1103/PhysRevC.100.014306 (same with #781 )

Fig 1(missed)

image

Fig 2(wrong head and figDesc)

<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0" coords="5,513.20,380.04,36.03,7.88;5,304.15,391.00,4.48,7.88;5,308.63,388.87,4.66,5.98;5,308.63,395.50,2.99,5.25;5,313.79,391.00,32.61,7.88;5,348.49,389.35,5.98,5.25;5,355.00,390.21,193.95,8.97;5,304.15,401.96,245.08,8.57;5,304.15,412.13,245.08,9.29;5,304.15,423.09,245.09,8.97;5,304.15,434.05,173.68,8.97">
	<head>B and the 1 + 1 0</head>
	<label>11</label>
	<figDesc>state of 10 B. The panels (a) are calculated with the THSR + pair wave function pair with optimized parameters. The panels (b) are obtained by using only the pairing term p with parameter c = 0. For all these calculations, β parameters are set to optimized values in the corresponding THSR + pair wave functions.</figDesc>
</figure>

image

Fig 3(correct)

<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1" coords="7,53.09,230.09,101.28,7.88;7,156.69,228.44,5.98,5.25;7,163.20,228.28,123.01,9.69;7,41.14,240.26,245.09,8.97;7,41.14,252.00,104.99,7.88">
	<head>FIG. 3 .</head>
	<label>3</label>
	<figDesc>FIG. 3. Energy curve of the 10 B(3 + 0) with respect to the parameter d. The parameter c is set to be d = 1 − c. Other parameters are fixed at the optimized values.</figDesc>
</figure>

image

v6.2.xml -> v7.0.xml

https://gist.github.com/elonzh/f4e59232ddaded31ee23735f994ea4b6/revisions

both versions have the same issue.

  • #FF005F V6.2
  • #ffd400 V7

image

elonzh avatar Jun 30 '21 09:06 elonzh

I think this is due to the vector graphics, there's still something I need to fix to take them into account when aggregating the figure zones. This is coming from pdfalto which provides the coordinates of vector graphics differently as before, and I think they are mostly not used now (likely one of the reasons why the figure are working so badly overall currently).

kermitt2 avatar Jul 01 '21 09:07 kermitt2

I think this is due to the vector graphics, there's still something I need to fix to take them into account when aggregating the figure zones. This is coming from pdfalto which provides the coordinates of vector graphics differently as before, and I think they are mostly not used now (likely one of the reasons why the figure are working so badly overall currently).

Thanks for your work, I will try to follow up on your work and make some contributions to this project.

elonzh avatar Jul 01 '21 09:07 elonzh

So indeed vector graphics are currently not considered when recognizing figures, leading to such errors (all these reported figures are vector graphics).

Some notes before I forget:

  • to be taken into account currently, svg files must be generated and processVectorGraphics true: for more flexibility we should introduce in pdfalto an option to generate the svg files but not the bitmap files and use the svg option by default

  • we have one svg file per page when some vector graphics is present, this svg contains all the vector graphics of the page (so possibly several different graphic figures). In Grobid VectorGraphicBoxCalculator is used to segment/aggregate non-trivial vector graphics elements. The issue is how the coordinates of these elements are found currently, using XQueryProcessor. It assumes some @x @y coordinates are present to get a bounding box for svg <g>. This is not always working when we simply have paths.

  • to better parse svg, we probably should use Apache Batik instead of XQueryProcessor. Batik will parse the svg file and provides bounding boxes via the element.getBBox() in VectorGraphicBoxCalculator.

kermitt2 avatar Jul 03 '21 18:07 kermitt2

for more flexibility we should introduce in pdfalto an option to generate the svg files but not the bitmap files and use the svg option by default

See https://github.com/kermitt2/pdfalto/issues/128

to better parse svg, we probably should use Apache Batik instead of XQueryProcessor. Batik will parse the svg file and provides bounding boxes via the element.getBBox() in VectorGraphicBoxCalculator.

Done in Grobid branch fix-vector-graphics, XQueryProcessor was replaced by parsing of SVG by Apache Batik. Bounding boxes of SVG elements are generated by Apache Batik too, leading to better support of SVG. We reuse then the existing vector box aggregation method, which leads to good figure content recognition.

But this then leads to two problems:

  • Some SVG files are huge: for instance Fig. 2 of the example above is a 60MB SVG file, with 378,268 independent <g> elements, so introducing the same amount of bounding boxes. They are then well aggregated in one vector box, but it takes several seconds. There is maybe no real solution for this, because even rasterizing the svg in a bitmap takes several seconds too.

  • The way figures and tables are currently recognized via the fulltext model and the layout tokens is not reliable enough and probably won't be even with more training data. Many clean figures content boxes are found but lost because we have no figure sequence found. Similarly we have incorrect figure sequence found while there is no figure box content around.

Proposal for the second point: redesign how figures and tables are recognized by removing them from the full text model and introduce their own segmentation models. These model would start from every aggregated graphics box in the document and try to extend these boxes with a dedicated sequence labeling model capturing blocks around based on layout clues+text as usual.

kermitt2 avatar Jul 08 '21 13:07 kermitt2

Same here. +1

officialsuyogdixit avatar Feb 02 '22 08:02 officialsuyogdixit