Prepare metadata extraction cache remover workflow responsible for a pre-selected cache entries removal or update cache_builder worfklow to work in an "overwrite" mode

Open marekhorst opened this issue 9 months ago • 0 comments

This is a direct follow-up of Grobid-based metadata extraction integration (#1512).

Since rebuilding the whole metadata extraction cache from scratch is an extremely time consuming task we might need to focus on updating cache rather than rebuilding.

Apart from updating cache for a newly processed PDF files, which is already possible, we could also cherry-pick already existing cache entries in order to replace them with their new versions coming from Grobid. This could be useful to fix all the identified cases of wrongly extracted metadata from a PDF file.

In order to achieve that we need to create a dedicated worfklow, similar to cache_builder to some extent, which will accept the list of OAIds of records to be dropped from cache. It will not run grobid at all, it will just remove whatever we want to be updated. An update will be simply performed with subsequent execution of the cache_builder workflow.

We can safely rely on an already existing building blocks (subworkflows) of the cache_builder workflow with some minor tweaks and changes:

defining a new cache_remover uber-workflow based on the already existing cache_builder workflow
a new uber-workflow shoud accept input_id datastore (similarly to cache_builder) which will provide the list of OAIds for which cached entries should be removed. This datastore will be created ouside of the scope of this task and it could be built with an arbitrary hive sql query producing Identifier avro datastore
replacing skip_extracted_without_meta with skip_extracted defined as transformers_metadataextraction_skip_extracted subworkflow in order to get all the matched entries from the cache (which will be a subject for removal)
[potentially optional] extract id fields (conveying checksums) from the ExtractedDocumentMetadata matched records retrieved from cache
defining a new transformer script (could be defined in PIG just to have all the scripts aligned and written in the same language) responsible for preparing a new, updated version of the cache. It should have the following API:
- input: existing cache datastore version to be updated
- input: matched records coming from cache (or just Identifier datastore with id field conveying the checksums to be removed from the existing version of the cache)
- output: new version of the cache with all the matched fields removed

Alternatively we could also consider running cache_builder in an “overwrite” mode. This will require significant changes to be applied to an already existing cache_builder workflow. So it will kind of over-complicate an already existing workflow but on the other hand it will simplify the cache update process by simply running one updated workflow instead of two (to be introduced cache_remover and already existing cache_builder).

Mar 24 '25 12:03 marekhorst