grobid
grobid copied to clipboard
Some PDFs crash the server
Some PDFs with unusual structures (I share this one for example) cause the entire server to crash with the error:
> Task :grobid-service:run FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':grobid-service:run'.
> Process 'command '/usr/lib/jvm/java-11-openjdk-amd64/bin/java'' finished with non-zero exit value 139
* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.
* Get more help at https://help.gradle.org
Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0.
You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.
See https://docs.gradle.org/7.2/userguide/command_line_interface.html#sec:command_line_warnings
I suspect the problem is the long list of authors in the references.
Hello @keto33 !
Thank you for the issue.
I could not reproduce the problem. On my side, with current master, the PDF you provided works fine and no crash, nor apparent failure (I attach the XML result). It works on the demo too (https://kermitt2-grobid.hf.space/).
Can you give more information so that we can try to reproduce the problem: version, environment, model used (CRF or DL), concurrency...
The list of authors in the references are indeed a bit unusual, but not so rare.
Thanks for prompt checking, @kermitt2
I have a clean installation of Grobid 0.7.3 (everything as defaults, except for blocksMax
and tokensMax
, which I increased to handle large files) on Ubuntu 22.04. I checked it several times; this PDF and a few other ones caused the server to crash. I checked on another machine (again, Ubuntu 22.04).
I uploaded two more PDFs for your reference. I was able to parse test5
via the Demo Website, but not test6
.
Thank you again @keto33 for these error cases. This is super useful !
The two cases should be fixed with c1bf0a2ec1fe678e568df62f43ba5ca210179e13. They are very weird cases, leading both to some combinatorial problems, so I simply added circuit breakers to avoid server failures.
-
test6.pdf has for some reason a block with more than 170,000 lines... which was an issue when recomposing the labeling with the segmentation model. It is likely a problem from pdfalto, and I will investigate it in this other repository
-
test5.pdf is a very long PDF (almost 400 pages) with around 1000 abstracts. The authors and the affiliations of each abstract were well recognized, but all concatenated in 2 mega huge lists of authors and affiliations. When trying to attach the right authors and their affiliations (all with the same marker numbers), the current algorithm was going nuts.
Sorry for the stupid auto-close
We have the same problem. We run lfoppiano/grobid:0.8.0 docker image in a docker swarm environment on Azure and Netcup infrastructure. Unfortunately, Grobid Service does not only shut down the container, but also the entire VM/host system at irregular intervals. Do you have this problem and do you know what to do about it?
@zeitderforschung could you share more information? What resources are allocated to the machine? Some logs would be also helpful.
Two grobid services with default config and 4G memory resource limit on a 3 x 16GB machine cluster. Should not crash the entire Docker environment. I read about the --init parameter in the GitHub issues, is it possible that without it the resource limits have no effect at all because of zombie processes that consume a lot of memory and trigger the oom killer?
@zeitderforschung we added TINY back after 0.8.0 was released, so there might be some problems related to the zombies pdfalto processes. Could you try to run it using the --init
parameter?
Yes, I have added the init parameter to the docker compose file and will see if that fixes the problem. Thanks for your help!
@lfoppiano I switched to the latest lfoppiano/grobid:latest-develop image which uses tini, seems more robust. Setting --init
alone with the 0.8.0 release caused vm crashes. However, this information is not reliable as I have no idea what is really happening and whether it is an infrastructure or service issue.
I have noticed that when I process the same PDF several times in a row on Hugging Face, the response times vary greatly, from 3s to 20s, up to a minute?
@zeitderforschung that's good. Just bear in mind that lfoppiano/grobid:latest-develop
might contain hidden gems 😄 because it's an automatic build that might be coming from either master or a PR. 😅 it's a bit of a surprise box.
I pushed a new 0.8.0 version that is the same as before but with an updated docker file, which includes TINY: https://hub.docker.com/layers/lfoppiano/grobid/0.8.0/images/sha256-72ea75c660304ee005027f417beb7c695abf0a62f5c64d5d793027984e7bbd2c?context=explore
Huggingface service that we have deployed is only for demo or small processing, I would not consider too much about performances.
@lfoppiano You are absolutely right, using dev builds is like playing russian roulette 😁 We will definitely switch to the new 0.8.0 release, thanks a lot!
how to deploy on kubernetes with --init?
hi @haochun we can't, with kubernetes you need to use a docker image built with tini
, e.g. grobid/grobid:0.8.1-SNAPSHOT
which is doing the same as --init.
(grobid/grobid:0.8.0
did not include tini, so don't use it on kubernetes !)
你好@haochun我们不能,对于 kubernetes,您需要使用使用 构建的 docker 映像
tini
,例如,grobid/grobid:0.8.1-SNAPSHOT
它与 --init 的作用相同。 (grobid/grobid:0.8.0
不包括 tini,所以不要在 kubernetes 上使用它!)
so, i can resolve this problem when change image to grobid/grobid:0.8.1-SNAPSHOT ?
so, i can resolve this problem when change image to grobid/grobid:0.8.1-SNAPSHOT ?
yes!
You can also use the lightweight version lfoppiano/grobid:0.8.0
if you don't need the deep learning models, it was rebuild with tini
too.