iis
iis copied to clipboard
Avoid placing temporary errors related to communication with Grobid as permanent faults in the cache
During the extensive tests it turned out all the Grobid communication related errors are stored as Faults in cache what makes given PDF extracted empty metadata to be permanently stored in cache.
The issues were caused by insufficient processing capacity of Grobid instance(s) which was already addressed in the following way:
- altering Grobid k8s configuration (auto-scaling, improved memory config etc)
- restricting the number of metadataextraction tasks being run in parallel at the same time (by relying on a dedicated queue)
It does not guarantee some occasional hiccups won't occur during the processing process.
Therefore we should introduce the following improvements in GrobidClient and MetadataExtractorMapper classes:
- understanding various HTTP error codes other than 200
- introducing retry mechanism whenever error occurs when communicating with Grobid server
** to be controlled with
grobid_server_throttle_sleep_timeandgrobid_server_max_retries_countinput parameters - all temporary issues should be logged and reflected in an appropriate metric (
import.metadataExtraction.processed.transientError) - every transient error, due to its temporary nature, is expected to be logged only and not written as a permanent
Fault. This way any subsequent run of thecache_builderworkflow will have a chance to pick up given PDF document and retry the metadata extraction process