grobid general paragraph text wrongly recognized as "figDesc/div/p"

What is your OS and architecture? Windows is not supported and Mac OS arm64 is not yet supported. For non-supported OS, you can use Docker (https://grobid.readthedocs.io/en/latest/Grobid-docker/)

I am using a docker container of docker pull lfoppiano/grobid:0.8.0

v0.7.3 also tested

What is your Java version (java --version)?

just used official docker: lfoppiano/grobid

In case of build or run errors, please submit the error while running gradlew with --stacktrace and --info for better log traces (e.g. ./gradlew run --stacktrace --info) or attach the log file logs/grobid-service.log.

No this file, as using docker

Problem

The general paragraph text which is not belong to a figure is wrongly recognized as a figDesc
Part of the wrongly recognized text as figDesc also in the general paraph text "body/div/p"
- This mean its repeated in two part of tei xml: "body/figure/figDesc/div/p" and "body/div/p"

original pdf area

extracted xml

Reference materials

Used pdf

176_liu2010.pdf

Result tei xml

note: github not accept .xml file, I modified its suffix as .txt

176_liu2010.pdf.tei.xml.txt

Jan 25 '24 01:01 sawyerzheng

Thanks @sawyerzheng for reporting the issue.

Indeed, there are two problems:

the paragraph is wrongly labeled as a figure. This is a common problem that we are (slowly) working on in PR https://github.com/kermitt2/grobid/pull/963 For the time being, we could add your example as training data, however, unfortunately, because this is Elsevier and the article is copyrighted, it's not possible to redistribute it as training data. Nevertheless, should you find the same problem in other papers with a Creative Commons licence, we could use it as a test case.
the paragraph from "In inert gas" is duplicated and out of order. It should be related to the figures processing. Please give me a couple of weeks, I should be able to fix it.

Jan 25 '24 07:01 lfoppiano

Thank you very much for your time.

So far, I have only been able to find one example PDF. If I come across a not copyrighted PDF with the similar problem in the future, I will upload it there.

Jan 25 '24 09:01 sawyerzheng

I found one pdf with open access. The pdf has similar problem.

This parse result from grobid gpu docker version: `grobid/grobid:0.7.2`

pdf: https://www.nature.com/articles/s41597-024-03160-z s41597-024-03160-z.pdf

Apr 01 '24 08:04 sawyerzheng

Indeeed. Thanks for finding an example we will surely add it to the training data

Apr 01 '24 11:04 lfoppiano

grobid grobid copied to clipboard

general paragraph text wrongly recognized as "figDesc/div/p"

Problem

original pdf area

extracted xml

Reference materials

Used pdf

Result tei xml

This parse result from grobid gpu docker version: grobid/grobid:0.7.2

grobid
grobid copied to clipboard

This parse result from grobid gpu docker version: `grobid/grobid:0.7.2`