grobid icon indicating copy to clipboard operation
grobid copied to clipboard

Is it possible to generate graphic elements with coords without saving the images into storage

Open ehapmgs opened this issue 3 years ago • 20 comments

I am trying to crop the images from the pdf by using the coords attributes inside the graphic elements but it looks like the graphic elements won't be generated unless pdfAssetPath in org/grobid/core/engines/config/GrobidAnalysisConfig.java is being set is it possible not to extract those images into the storge and be able to get their coords in the pdf?

ehapmgs avatar Feb 10 '21 18:02 ehapmgs

It looks like the images are being copied from the temp dir to pdfAssetPath so I don't think the pdfAssetPath needed to be set to be able to generate graphic elements with coords. can this be considered as a bug since the code does not generate graphic elements unless pdfAssetPath is set?

ehapmgs avatar Feb 11 '21 20:02 ehapmgs

Hello @ehapmgs !

You are right, if the asset path is not defined, the embedded graphics are not extracted and not referenced in the TEI with their coordinates. It's not very clear why we would like to crop these embedded graphics from the PDF while they can be extracted in a more "original" format by Grobid upstream.

Now to be honest I don't know anymore what to do with these graphics embedded in the PDF... the feature is still there, but I am really not sure if it's of any real use.

A few years ago I thought it would be nice to extract them and reference them in the figures where they appear. However, they are not always usable - we can have hundred/thousand of embedded graphics in one pdf. It's hard to manage in general (in particular the SVG) and can lead to errors. We might rather want to crop the figure or figure content, which would ensure than we have a global graphic element, and not hundred of pieces as it could be the case. So my question would be: would you not prefer to use the bounding box coordinates associated to the figure element for the crop rather than the ones of the embedded graphics?

kermitt2 avatar Feb 11 '21 21:02 kermitt2

Hey @kermitt2

The coords that are associated with the figure element seems a bit different when the pdfAssetPath set vs not set

When pdfAssetPath is set they are pointing to the graphic element that would be useful to my use case Screen Shot 2021-02-11 at 11 40 23 PM

When pdfAssetPath is not set , it looks like they are pointing to the caption of the figure or the text around the image itself Screen Shot 2021-02-11 at 11 37 30 PM

Am I missing something?

ehapmgs avatar Feb 11 '21 21:02 ehapmgs

You're not missing anything, this is exactly how expected.

It's not very clear why we would like to crop these embedded graphics from the PDF while they can be extracted in a more "original" format by Grobid upstream.

So I don't really understand your use case... if you want the graphic elements and you can get them as extracted from the PDF without bothering with a crop, why not setting pdfAssetPath?

kermitt2 avatar Feb 11 '21 23:02 kermitt2

We probably want to have the coordinates at figure level for the global figure zone and the coordinates for the figure content (the graphics) in an element under <figure>? So independently from pdfAssetPath and graphic elements as external file, correct ?

kermitt2 avatar Feb 11 '21 23:02 kermitt2

Sorry I got confused a bit, the coords associated with the ‘figure’ element what they are really pointing at?, in the first screenshot there is only one coords and it is pointing to the image inside the figure

but in the second screenshot where the ‘pdfAssetPath’ is not set there is a set of coordinates pointing to different lines of the figure caption but not the image, how I can crop the image itself in this case and why the coordinates have changed when the ‘pdfAssetPath’ is set vs when it is null

ehapmgs avatar Feb 12 '21 00:02 ehapmgs

I would say in principle the coordinates of <figure> are the bounding boxes of all contained elements.

In the second case, the graphic element is not available so the area misses that part. Without pdfAssetPath graphic elements are not available apparently, so it may be a problem or a limitation of the call to pdfalto.

In the first case, indeed selecting only graphic element for coordinates is not consistent and we should have a composition of the bounding boxes with the graphic element and the captions/figure title rather that selecting just the graphic element. I don't remember why only the graphic element coordinates are used at figure level, agree that's a bug !

kermitt2 avatar Feb 12 '21 00:02 kermitt2

I pushed a quick fix for the first case in branch fix-#710, but overall it looks working very badly (I mean the recognition of figures in general) and the url attributes for the <graphic> are not working any more apparently (they should point to actual file under the local asset directory where the resulting XML is produced). All the figure and table recognition would need some more work and attention, and probably all the "assets" part being deprecated to just rely on crop directly on the PDF.

kermitt2 avatar Feb 12 '21 17:02 kermitt2

I've worked a bit on this again - this is mainly a regression problem in pdfalto as compared to pdf2xml, and we should have something closer to what is expected now:

  • the graphics elements attached to figures are present with coordinates (coordinates are always added currently for the different <graphics>)
<figure
                xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4">
                <head>Figure 3 :</head>
                <label>3</label>
                <figDesc>Figure 3: Augmented PDF using the Softcite text mining tool: mentions of software and their attributes are display on top of the PDF as HTML dynamic layout, via the standard PDF.js library (left part). The user can interact directly in situ with the annotations, opening info boxes with Wikidata disambiguated information and local consolidated bibliographical reference relevant to the software.</figDesc>
                <graphic coords="9,317.96,86.60,246.75,221.90" type="bitmap" />
            </figure>
  • path to external svg and png files when extracted have been corrected:
 <figure
                xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1">
                <head>Figure 1 :</head>
                <label>1</label>
                <figDesc>Figure 1: Overview of the Softcite software extraction pipeline. Blue boxes represent the main processing components, and ovals the different data results.</figDesc>
                <graphic url="qUQH5GGp99.lxml_data/image-1.png" coords="4,53.80,86.60,247.50,288.90" type="bitmap" />
            </figure>
  • when coordinates are generated for figure/table, we have the global figure area coordinates (including graphics, caption, figure title) at the <figure> element and the coordinates of the graphic element
<figure
                xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1" coords="8,159.48,396.48,368.20,6.80;8,159.48,404.22,403.81,8.53;8,159.48,415.47,398.85,7.62;8,558.31,413.69,4.96,4.53;8,159.48,423.66,313.98,8.08;8,159.48,440.47,152.70,6.80;8,159.48,78.01,416.47,310.28">
                <head>Fig 2 .</head>
                <label>2</label>
                <figDesc>Fig 2. Inhibition on LAMP assays by chitosan. Observed LAMP threshold times for EF samples containing varying concentrations of chitosan (0, 0.01, 0.1, 1 g L -1 ). Chitosan completely inhibited LAMP at concentrations above 1 g L -1 . Each data point represents the mean threshold time (t T ) from (n = 3) LAMP assays. Treatments significantly different than control (0 g L -1 chitosan, green dot) are indicated by asterisk (* p&lt;0.05) Error bars are standard errors of the mean. https://doi.org/10.1371/journal.pone.0244956.g002</figDesc>
                <graphic coords="8,159.48,78.01,416.47,310.28" type="bitmap" />
            </figure>
  • I've added the figure reference/crop in the PDF annotation demo in the console, to better visualize the extracted figures. There's much more work needed on this, even just to get it working again at the level it was working in the past (the version used at ResearchGate).

Screen Shot 2021-06-07 at 16 01 22

It works with table too:

Screen Shot 2021-06-07 at 16 00 23

kermitt2 avatar Jun 07 '21 14:06 kermitt2

Sorry the shitty auto-close strikes again ;)

kermitt2 avatar Jun 07 '21 21:06 kermitt2

Nicely done @kermitt2, Do we still need to set pdfAssetPath where we need only the coordinates of the graphic as in case number 1 and 3 ?

ehapmgs avatar Jun 08 '21 19:06 ehapmgs

@ehapmgs no you don't need to set pdfAssetPath. The graphics coordinates are available also when images are not extracted. Normally it works with SVG too (with actual coordinates, not page size).

kermitt2 avatar Jun 08 '21 23:06 kermitt2

I tested the latest master version, seems figure coords are only returned in /api/referenceAnnotations

elonzh avatar Jul 01 '21 07:07 elonzh

It looks good for me with the TEI results too:

Screenshot from 2021-07-01 11-04-19

kermitt2 avatar Jul 01 '21 09:07 kermitt2

Seems an error case related to https://github.com/kermitt2/grobid/issues/787

I was testing the graphic output using that paper.

elonzh avatar Jul 01 '21 09:07 elonzh

Hello @kermitt2, What do I see here? Can I convert a PDF to TEI and show the results like this? Converting text and images with Grobid works great but for tables not so much, if I can show them like that (image?) it would be really helpful.

Jacob

It works with table too:

Screen Shot 2021-06-07 at 16 00 23

Jacob-Jan avatar Feb 18 '22 11:02 Jacob-Jan

Hi @Jacob-Jan

What do I see here? Can I convert a PDF to TEI and show the results like this?

Yes you can get the coordinates of the figure and table "areas" with @coords on <figure> as indicated above in the resulting TEI. You can also get these coordinates in JSON with the service /api/referenceAnnotations https://grobid.readthedocs.io/en/latest/Grobid-service/#apireferenceannotations (you get all the reference markers coordinates and the "objects" they reference, including tables). The demo you see (the Grobid console) uses the JSON answer with coordinates and simply make a crop on the displayed PDF (several PDF libraries can do that).

Converting text and images with Grobid works great but for tables not so much, if I can show them like that (image?) it would be really helpful.

Identifying the table areas is not working very well neither for the moment I must say, but it works better than lower level parsing of the table in the TEI. You can do a crop in the PDF of the area via the coordinates. I am working on a new approach to identify and structure figures and tables (after reading a few papers on this) and I hope it's going to work much better in a few months.

kermitt2 avatar Feb 22 '22 17:02 kermitt2

Yes you can get the coordinates of the figure and table "areas" with @coords on <figure> as indicated above in the resulting TEI. You can also get these coordinates in JSON with the service /api/referenceAnnotations https://grobid.readthedocs.io/en/latest/Grobid-service/#apireferenceannotations (you get all the reference markers coordinates and the "objects" they reference, including tables). The demo you see (the Grobid console) uses the JSON answer with coordinates and simply make a crop on the displayed PDF (several PDF libraries can do that).

@kermitt2 Hello, thank you for your work, I have a question, I enter this article, in the article Figure 4 Figure 5 are all such figures, image

however, the coords of Figure in the result returned by grobid is 3 coordinates, what does this mean? image

By the way, what is the difference between the coords of graphic and the coords of figures?

thanks a lot for your reply.

pureblacker avatar Jun 01 '23 07:06 pureblacker

Hi @pureblacker

Thanks for testing Grobid !

Normally (when it works), the coordinates of the full figure (including figure title, captions, graphics) are given by the @coord attribute on the <figure> element. The graphics are the embedded bitmap or the vector graphic (SVG) part(s) of the figure. Users can then do a crop on the full figure to get also the caption and title for example, or maybe they are just interested in the graphic part, not the text.

Sometimes the caption of the figure has a bibliographical reference marker (<ref>) which can also have its own coordinates.

The next version of Grobid should improve the figure recognition very significantly with new dedicated models and much more extensive analysis of the vector graphics.

kermitt2 avatar Jun 01 '23 08:06 kermitt2

Just a quick question, how would one go about extracting all the figures' graphics coordinates assuming you're using the Grobid docker image and python client? I'm able to get ~4-5 figures to attach graphic coordinates out of 30 in a document. A lot of the time the XML generated doesn't even include graphics.

SulRash avatar Mar 11 '24 13:03 SulRash