grobid icon indicating copy to clipboard operation
grobid copied to clipboard

general paragraph text wrongly recognized as "figDesc/div/p"

Open sawyerzheng opened this issue 1 year ago • 4 comments

  • What is your OS and architecture? Windows is not supported and Mac OS arm64 is not yet supported. For non-supported OS, you can use Docker (https://grobid.readthedocs.io/en/latest/Grobid-docker/)

I am using a docker container of docker pull lfoppiano/grobid:0.8.0

v0.7.3 also tested

  • What is your Java version (java --version)?

just used official docker: lfoppiano/grobid

  • In case of build or run errors, please submit the error while running gradlew with --stacktrace and --info for better log traces (e.g. ./gradlew run --stacktrace --info) or attach the log file logs/grobid-service.log.

No this file, as using docker


Problem

  1. The general paragraph text which is not belong to a figure is wrongly recognized as a figDesc
  2. Part of the wrongly recognized text as figDesc also in the general paraph text "body/div/p"
    • This mean its repeated in two part of tei xml: "body/figure/figDesc/div/p" and "body/div/p"

original pdf area

image

extracted xml

image

Reference materials

Used pdf

176_liu2010.pdf

Result tei xml

note: github not accept .xml file, I modified its suffix as .txt

176_liu2010.pdf.tei.xml.txt

sawyerzheng avatar Jan 25 '24 01:01 sawyerzheng

Thanks @sawyerzheng for reporting the issue.

Indeed, there are two problems:

  1. the paragraph is wrongly labeled as a figure. This is a common problem that we are (slowly) working on in PR https://github.com/kermitt2/grobid/pull/963 For the time being, we could add your example as training data, however, unfortunately, because this is Elsevier and the article is copyrighted, it's not possible to redistribute it as training data. Nevertheless, should you find the same problem in other papers with a Creative Commons licence, we could use it as a test case.

  2. the paragraph from "In inert gas" is duplicated and out of order. It should be related to the figures processing. Please give me a couple of weeks, I should be able to fix it.

lfoppiano avatar Jan 25 '24 07:01 lfoppiano

Thank you very much for your time.

So far, I have only been able to find one example PDF. If I come across a not copyrighted PDF with the similar problem in the future, I will upload it there.

sawyerzheng avatar Jan 25 '24 09:01 sawyerzheng

I found one pdf with open access. The pdf has similar problem.

This parse result from grobid gpu docker version: grobid/grobid:0.7.2

Snipaste_2024-04-01_16-32-30

image


image

pdf: https://www.nature.com/articles/s41597-024-03160-z s41597-024-03160-z.pdf

sawyerzheng avatar Apr 01 '24 08:04 sawyerzheng

Indeeed. Thanks for finding an example we will surely add it to the training data

lfoppiano avatar Apr 01 '24 11:04 lfoppiano