grobid Some PDFs crash the server

Some PDFs with unusual structures (I share this one for example) cause the entire server to crash with the error:

> Task :grobid-service:run FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':grobid-service:run'.
> Process 'command '/usr/lib/jvm/java-11-openjdk-amd64/bin/java'' finished with non-zero exit value 139

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

See https://docs.gradle.org/7.2/userguide/command_line_interface.html#sec:command_line_warnings

I suspect the problem is the long list of authors in the references.

Sep 09 '23 14:09 keto33

Hello @keto33 !

Thank you for the issue.

I could not reproduce the problem. On my side, with current master, the PDF you provided works fine and no crash, nor apparent failure (I attach the XML result). It works on the demo too (https://kermitt2-grobid.hf.space/).

Can you give more information so that we can try to reproduce the problem: version, environment, model used (CRF or DL), concurrency...

The list of authors in the references are indeed a bit unusual, but not so rare.

test-pdf.tei.xml.zip

Sep 09 '23 15:09 kermitt2

Thanks for prompt checking, @kermitt2

I have a clean installation of Grobid 0.7.3 (everything as defaults, except for blocksMax and tokensMax, which I increased to handle large files) on Ubuntu 22.04. I checked it several times; this PDF and a few other ones caused the server to crash. I checked on another machine (again, Ubuntu 22.04).

I uploaded two more PDFs for your reference. I was able to parse test5 via the Demo Website, but not test6.

Screenshot from 2023-09-10 11-07-35

Sep 10 '23 10:09 keto33

Thank you again @keto33 for these error cases. This is super useful !

The two cases should be fixed with c1bf0a2ec1fe678e568df62f43ba5ca210179e13. They are very weird cases, leading both to some combinatorial problems, so I simply added circuit breakers to avoid server failures.

test6.pdf has for some reason a block with more than 170,000 lines... which was an issue when recomposing the labeling with the segmentation model. It is likely a problem from pdfalto, and I will investigate it in this other repository
test5.pdf is a very long PDF (almost 400 pages) with around 1000 abstracts. The authors and the affiliations of each abstract were well recognized, but all concatenated in 2 mega huge lists of authors and affiliations. When trying to attach the right authors and their affiliations (all with the same marker numbers), the current algorithm was going nuts.

Nov 14 '23 13:11 kermitt2

Sorry for the stupid auto-close

Nov 18 '23 19:11 kermitt2

We have the same problem. We run lfoppiano/grobid:0.8.0 docker image in a docker swarm environment on Azure and Netcup infrastructure. Unfortunately, Grobid Service does not only shut down the container, but also the entire VM/host system at irregular intervals. Do you have this problem and do you know what to do about it?

Mar 22 '24 08:03 zeitderforschung

@zeitderforschung could you share more information? What resources are allocated to the machine? Some logs would be also helpful.

Mar 24 '24 09:03 lfoppiano

Two grobid services with default config and 4G memory resource limit on a 3 x 16GB machine cluster. Should not crash the entire Docker environment. I read about the --init parameter in the GitHub issues, is it possible that without it the resource limits have no effect at all because of zombie processes that consume a lot of memory and trigger the oom killer?

Mar 24 '24 18:03 zeitderforschung

@zeitderforschung we added TINY back after 0.8.0 was released, so there might be some problems related to the zombies pdfalto processes. Could you try to run it using the --init parameter?

Mar 25 '24 00:03 lfoppiano

Yes, I have added the init parameter to the docker compose file and will see if that fixes the problem. Thanks for your help!

Mar 25 '24 07:03 zeitderforschung

@lfoppiano I switched to the latest lfoppiano/grobid:latest-develop image which uses tini, seems more robust. Setting --init alone with the 0.8.0 release caused vm crashes. However, this information is not reliable as I have no idea what is really happening and whether it is an infrastructure or service issue.

I have noticed that when I process the same PDF several times in a row on Hugging Face, the response times vary greatly, from 3s to 20s, up to a minute?

Mar 27 '24 09:03 zeitderforschung

@zeitderforschung that's good. Just bear in mind that lfoppiano/grobid:latest-develop might contain hidden gems 😄 because it's an automatic build that might be coming from either master or a PR. 😅 it's a bit of a surprise box.

I pushed a new 0.8.0 version that is the same as before but with an updated docker file, which includes TINY: https://hub.docker.com/layers/lfoppiano/grobid/0.8.0/images/sha256-72ea75c660304ee005027f417beb7c695abf0a62f5c64d5d793027984e7bbd2c?context=explore

Huggingface service that we have deployed is only for demo or small processing, I would not consider too much about performances.

Mar 27 '24 21:03 lfoppiano

@lfoppiano You are absolutely right, using dev builds is like playing russian roulette 😁 We will definitely switch to the new 0.8.0 release, thanks a lot!

Mar 27 '24 22:03 zeitderforschung

how to deploy on kubernetes with --init?

May 09 '24 08:05 haochun

hi @haochun we can't, with kubernetes you need to use a docker image built with tini, e.g. grobid/grobid:0.8.1-SNAPSHOT which is doing the same as --init. (grobid/grobid:0.8.0 did not include tini, so don't use it on kubernetes !)

May 09 '24 09:05 kermitt2

你好@haochun我们不能，对于 kubernetes，您需要使用使用构建的 docker 映像tini，例如，grobid/grobid:0.8.1-SNAPSHOT它与 --init 的作用相同。（grobid/grobid:0.8.0不包括 tini，所以不要在 kubernetes 上使用它！）

so, i can resolve this problem when change image to grobid/grobid:0.8.1-SNAPSHOT ?

May 09 '24 09:05 haochun

so, i can resolve this problem when change image to grobid/grobid:0.8.1-SNAPSHOT ?

yes!

You can also use the lightweight version lfoppiano/grobid:0.8.0 if you don't need the deep learning models, it was rebuild with tini too.

May 09 '24 09:05 kermitt2

grobid grobid copied to clipboard

Some PDFs crash the server

grobid
grobid copied to clipboard