grobid icon indicating copy to clipboard operation
grobid copied to clipboard

Coordinates of caption elements

Open keto33 opened this issue 1 year ago • 5 comments

This may seem unnecessary, but it should be a feasible feature suggestion.

GROBID outputs all coordinates of structures except for text blocks. I am mostly interested in the coordinates of figure captions. When figures are embedded as EPS in vector format rather than raster/bitmap, GROBID does not correctly detect the bounding box of the figure, as drawings and texts are somehow blended into the PDF structure rather than being a distinguishable stream. In such cases, the bounding box of the figure caption can be helpful in estimating the actual bounding box of the EPS figure.

keto33 avatar May 01 '23 10:05 keto33

Hi @keto33 !

Thanks for the issue.

GROBID outputs all coordinates of structures except for text blocks.

Yes text blocks are not part of the TEI XML output because they are presentation/layout elements, not something related to the logicial structure of the document (like paragraphs, titles, etc.).

I am mostly interested in the coordinates of figure captions. When figures are embedded as EPS in vector format rather than raster/bitmap, GROBID does not correctly detect the bounding box of the figure, as drawings and texts are somehow blended into the PDF structure rather than being a distinguishable stream.

Yes the coordinates of the caption elements are indeed not outputted currently and there is no reason not to do it.

Regarding the "graphic part" of a figure, this is more or less implemented in PR #963 (the whole PR is not usable at this stage, really work in progress), the vector graphics are further analyzed to detect their boundaries, deal with overlapped text, etc. so that we have reliable "figure graphic" aggregated elements similar to the embedded bitmaps. There are many other things in this PR and it will take a lot time to be completed !

kermitt2 avatar May 01 '23 10:05 kermitt2

Hello!

Is there an ongoing effort or a specific branch where coordinates of text blocks can be extracted as part of the TEI/XML output?

I checked the documentation and I saw p elements are under teiCoordinates, and I am running this command:

curl --form input=@./Papers/test.pdf --form teiCoordinates='head' --form teiCoordinates='p' host:8070/api/processFulltextDocument

However there are no coordinates for the p elements, which I'm interested in. image

Please let me know if there is a solution or anything I can do to assist!

ClementFrvl avatar Aug 11 '24 10:08 ClementFrvl

Hi @ClementFrvl, which version are you using? This seems a problem of grobid version 0.8.0 which disappears on the grobid master's version. 🤔

lfoppiano avatar Aug 11 '24 14:08 lfoppiano

Hey, I am using 0.8.0, that may be the reason why.

My server is ARM-based though, I just tried with version 0.7.3, but I'm having the same issue.

image

Is there a newer arm version available ?

ClementFrvl avatar Aug 11 '24 15:08 ClementFrvl

We're working on a new version since a few weeks, hopefully we will be able to release soon.

lfoppiano avatar Aug 11 '24 19:08 lfoppiano