grobid
grobid copied to clipboard
A machine learning software for extracting information from scholarly documents
Hi @kermitt2, I've noticed that list items are excluded from being labeled by the fulltext model. Are you interested in their implementation? Maybe you are considering to put them into...
I ran processFulltextDocument on 22103 arXiv PDFs. 22053 PDFs succeeded and 50 failed. Running on MacOS M2 chip Java version: 17.0.10 Server started with Gradle (`./gradlew run`) An example error...
Hi mighty developers I am using GROBID for research which I need to extract text (processFulltextDocument) from some company annual report PDF files. I know GROBID is designed for academic...
This is an error case not to forget that causes some trouble with the sentence segmentation. The document is not CC-BY, referenced here: https://dx.doi.org/10.1063/1.1874292 Here the `delinquent` paragraph: With version...
I have used a Java client in my maven project. The problem is that URL where grobid-core is located doesn't exist any more on this link: https://grobid.s3.eu-west-1.amazonaws.com/repo Please, upload or...
Hey so I'm running grobid on my Mac as a rest service and on a batch of about 400 documents, a couple of them have this error (file attached). [error.txt](https://github.com/kermitt2/grobid/files/407876/error.txt)...
Hi I am new to Grobid and really need help I am trying to extract the section headers and while they do appear normally in the tag, it does not...
When we search for a DOI in the page, the regex may truncate DOIs that are split by a breakline, so this PR proposes a simple fix that is to...
as in title, more info in #1126
I've noticed that there are some cases where the DOI is correctly extracted from the article header, however, they are incorrectly mangled in the output. Example: [origin9833693929434438741.pdf](https://github.com/user-attachments/files/15757039/origin9833693929434438741.pdf) In this article...