iis icon indicating copy to clipboard operation
iis copied to clipboard

Avoid placing temporary errors related to communication with Grobid as permanent faults in the cache

Open marekhorst opened this issue 2 months ago • 0 comments

During the extensive tests it turned out all the Grobid communication related errors are stored as Faults in cache what makes given PDF extracted empty metadata to be permanently stored in cache.

The issues were caused by insufficient processing capacity of Grobid instance(s) which was already addressed in the following way:

  • altering Grobid k8s configuration (auto-scaling, improved memory config etc)
  • restricting the number of metadataextraction tasks being run in parallel at the same time (by relying on a dedicated queue)

It does not guarantee some occasional hiccups won't occur during the processing process.

Therefore we should introduce the following improvements in GrobidClient and MetadataExtractorMapper classes:

  • understanding various HTTP error codes other than 200
  • introducing retry mechanism whenever error occurs when communicating with Grobid server ** to be controlled with grobid_server_throttle_sleep_time and grobid_server_max_retries_count input parameters
  • all temporary issues should be logged and reflected in an appropriate metric (import.metadataExtraction.processed.transientError)
  • every transient error, due to its temporary nature, is expected to be logged only and not written as a permanent Fault. This way any subsequent run of the cache_builder workflow will have a chance to pick up given PDF document and retry the metadata extraction process

marekhorst avatar Oct 30 '25 16:10 marekhorst