grobid icon indicating copy to clipboard operation
grobid copied to clipboard

GROBID disk I/O issues (temp directory config related)

Open bnewbold opened this issue 2 years ago • 3 comments

In our use of GROBID, we have machines with a reasonable number of cores and RAM (eg, 30 cores, 40GB RAM), but poor disk I/O. This makes it important to have GROBID not write to disk, or to use a ramdisk (aka, virtual RAM-backed partition) if it must (eg, for interaction with pdfalto).

In the past it was possible to configure grobid.temp to point to, eg, /run/grobid/tmp, which we configured on Linux to be a ramdisk. In newer versions of GROBID, it looks like this doesn't work any more, due to this change: https://github.com/kermitt2/grobid/commit/c8e11b8d8f6cf3fd7824091d0ee4b3b731520661#diff-65f7e37a114e9b9339efbb8ec03c4b19aec2f6998f127d539b6a07b01aa9b303L360-R362

Eg, if we use YAML to configure:

grobid:
    temp: "/run/grobid/tmp"

then I can see GROBID writing PDF files to: /srv/grobid/grobid-service-0.7.0-131-gdd0251d9f/grobid-home/run/grobid/tmp/origin2651762335153943539.pdf (a relative path, not an absolute path).

I don't know the Java APIs well enough to recommend an alternative function to use, but it seems like it should be possible to use grobid-home as a prefix for relative paths, but fall back and allow absolute paths if the grobid.temp variable is an absolute path.

Separately, I can also see files like /tmp/MIME2368838021331894851.tmp getting written, and it seems like the GROBID java process is writing these. I think this is due to Jersey? I vaguely remember being able to control the location these get written using the TMPDIR UNIX environment variable in the past, but that doesn't seem to be working. It would be great to be able to control this location, or just have it be the same as grobid.temp.

A work around for the first issue (absolute paths not possible) is to create a symlink to the location I want. I can't think of a way to do that with the second problem, without having the entire /tmp directory be a random or symlink, which could have other unintended consequences.

bnewbold avatar Dec 07 '21 23:12 bnewbold

+1 as I am dealing with a similar problem. Setting the temporary directory using a config file would be helpful.

iiLaurens avatar Jul 04 '22 15:07 iiLaurens

I've quickly made a PR (#932) with a change that uses it the temporary directory as it is, if the path is absolute and as before, if the path is relative. Maybe you can test it. 😅

lfoppiano avatar Jul 05 '22 01:07 lfoppiano

PR tested and merged !

kermitt2 avatar Jul 05 '22 16:07 kermitt2