grobid
grobid copied to clipboard
general paragraph text wrongly recognized as "figDesc/div/p"
- What is your OS and architecture? Windows is not supported and Mac OS arm64 is not yet supported. For non-supported OS, you can use Docker (https://grobid.readthedocs.io/en/latest/Grobid-docker/)
I am using a docker container of docker pull lfoppiano/grobid:0.8.0
v0.7.3 also tested
- What is your Java version (
java --version
)?
just used official docker: lfoppiano/grobid
- In case of build or run errors, please submit the error while running gradlew with
--stacktrace
and--info
for better log traces (e.g../gradlew run --stacktrace --info
) or attach the log filelogs/grobid-service.log
.
No this file, as using docker
Problem
- The general paragraph text which is not belong to a figure is wrongly recognized as a
figDesc
- Part of the wrongly recognized text as figDesc also in the general paraph text "body/div/p"
- This mean its repeated in two part of tei xml: "body/figure/figDesc/div/p" and "body/div/p"
original pdf area
extracted xml
Reference materials
Used pdf
Result tei xml
note: github not accept .xml file, I modified its suffix as .txt
Thanks @sawyerzheng for reporting the issue.
Indeed, there are two problems:
-
the paragraph is wrongly labeled as a figure. This is a common problem that we are (slowly) working on in PR https://github.com/kermitt2/grobid/pull/963 For the time being, we could add your example as training data, however, unfortunately, because this is Elsevier and the article is copyrighted, it's not possible to redistribute it as training data. Nevertheless, should you find the same problem in other papers with a Creative Commons licence, we could use it as a test case.
-
the paragraph from "In inert gas" is duplicated and out of order. It should be related to the figures processing. Please give me a couple of weeks, I should be able to fix it.
Thank you very much for your time.
So far, I have only been able to find one example PDF. If I come across a not copyrighted PDF with the similar problem in the future, I will upload it there.
I found one pdf with open access. The pdf has similar problem.
This parse result from grobid gpu docker version: grobid/grobid:0.7.2
pdf: https://www.nature.com/articles/s41597-024-03160-z s41597-024-03160-z.pdf
Indeeed. Thanks for finding an example we will surely add it to the training data