invenio
invenio copied to clipboard
bst_create_icons does not finish with certain documents
We tried to create a bunch of icons for our FullTexts collection, as they were suppressed upon original ingestion. Thus we called
$ inv $ib/bibtasklet -N createicons -T bst_create_icons -a recid=False -a collection=FullTexts -a icon_sizes=180,640,1440 -u admin
(Note: calling syntax for handling a whole collection is not clear cf. issue #2192. The above worked with a slightly modified version of the tasklet, just ignoring the recid altogether and going for the collection right away.)
This started the process, but it seems that for certain documents the icon creation fails. Unfortunately, the call to the externals does not return, thus the bibtasklet
is hanging in the bibsched
-queue and even worse hindering other tasks to proceed, as the tasklet is About going to sleep
forever. There is no message indicating a hanging job in the tasklets logs.
An example of a failing document can be found here: http://bib-pubdb1.desy.de/record/139620
To be protected against never-ending external processes we have invenio.shellutils.run_process_with_timeout
def run_process_with_timeout(args, filename_in=None, filename_out=None, filename_err=None, cwd=None, timeout=CFG_MISCUTIL_DEFAULT_PROCESS_TIMEOUT, sudo=None):
@ludmilamarian have you had something similar on CDS?
Given what is mentioned in #2192 this is related to Invenio 1.1.3.
Tracking it further down, it seems pdftk
is not finishing it's job.
Actually, we found, that pdftk
can not handle the file in question and fails. Sometimes, it returns, sometimes, it just hangs. In case of the latter, it is actually behaving pretty badly, eating up 100% cpu at this point, so you might end up with quite a load if bst_create_icons
runs against a larger collection. (Luckily, however it hangs at the point in question so if you clean up by hand and get rid of all the zombies afterwards... ;)
Martin found some mentions on the web (unfortunately he didn't give me a pointer) that there is/was/persists to be some issue with signal handling if pdftk is called via system()
/exec()
/fork()
or friends from python
, php
or the like. Probably, this is of help. Probably requirement of pdftk
ends up at >x.yy?
Our pdftk
is v1.44 from SL5.10.
@egabancho @ludmilamarian Any updates on this one?
unfortunately we did not experience this issue, thus we can't really provide a solution. What I can say is that we did large amounts of conversions, and everything was ok for us. We are currently using pdftk v2.02
. I assume things are better now, the last message on this thread was in 2014.
I assume things are better now, the last message on this thread was in 2014.
...assuming, that pdftk
and it's toolchain was updated in the meantime this might be the case, yes. Note that on the quite common SL 6.x is still v1.44.
Note also, that if you have a PDF from a broad range of publishers/processing tools it may well happen that some parts of the tool chain can not handle it properly. (As usual, you're quite lucky in HEP here as it is, again as usual, quite homogeneous.)
Anyway, I'm not sure if it's possible to detect such a hanging tool and kill a job in these cases. Say by some timeout detection.
Such a functionality is there in invenio: https://github.com/inveniosoftware/invenio/blob/maint-1.1/modules/miscutil/lib/shellutils.py#L158
but is not used to create icons: https://github.com/inveniosoftware/invenio/blob/maint-1.1/modules/websubmit/lib/websubmit_icon_creator.py#L311
Is there anyone willing to contribute a fix?