grobid issues

Lists implementation for fulltext model

14

Hi @kermitt2, I've noticed that list items are excluded from being labeled by the fulltext model. Are you interested in their implementation? Maybe you are considering to put them into...

Vitaliy-1

processFulltextDocument fails on 0.23% arXiv PDFs

6

I ran processFulltextDocument on 22103 arXiv PDFs. 22053 PDFs succeeded and 50 failed. Running on MacOS M2 chip Java version: 17.0.10 Server started with Gradle (`./gradlew run`) An example error...

MarksonChen

bug

implemented

Errors: [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 134 and 139

3

Hi mighty developers I am using GROBID for research which I need to extract text (processFulltextDocument) from some company annual report PDF files. I know GROBID is designed for academic...

RANN9

Sentence segmentation error case

1

This is an error case not to forget that causes some trouble with the sentence segmentation. The document is not CC-BY, referenced here: https://dx.doi.org/10.1063/1.1874292 Here the `delinquent` paragraph: With version...

lfoppiano

bug

implemented

Repo is not avilable from Maven

2

I have used a Java client in my maven project. The problem is that URL where grobid-core is located doesn't exist any more on this link: https://grobid.s3.eu-west-1.amazonaws.com/repo Please, upload or...

mladenbabic

Internal Server Error

9

Hey so I'm running grobid on my Mac as a rest service and on a batch of about 400 documents, a couple of them have this error (file attached). [error.txt](https://github.com/kermitt2/grobid/files/407876/error.txt)...

Singularity9971

macOS-specific

GROBID not able to extract the header numbers when they're in Roman or in alphabets

1

Hi I am new to Grobid and really need help I am trying to extract the section headers and while they do appear normally in the tag, it does not...

alwaysaditi

Avoid replacing DOIs with shorter ones

1

When we search for a DOI in the page, the regex may truncate DOIs that are split by a breakline, so this PR proposes a simple fix that is to...

lfoppiano

bug

Move the period outside the <idno> tag

1

as in title, more info in #1126

lfoppiano

DOI extraction

2

I've noticed that there are some cases where the DOI is correctly extracted from the article header, however, they are incorrectly mangled in the output. Example: [origin9833693929434438741.pdf](https://github.com/user-attachments/files/15757039/origin9833693929434438741.pdf) In this article...

lfoppiano

grobid
grobid copied to clipboard

Metadata

Lists implementation for fulltext model

processFulltextDocument fails on 0.23% arXiv PDFs

Errors: [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 134 and 139

Sentence segmentation error case

Repo is not avilable from Maven

Internal Server Error

GROBID not able to extract the header numbers when they're in Roman or in alphabets

Avoid replacing DOIs with shorter ones

Move the period outside the <idno> tag

DOI extraction

← Metadata

Owner

Metadata

grobid grobid copied to clipboard

Metadata

← Metadata

Owner

Metadata

grobid
grobid copied to clipboard