Prepare metadata extraction cache remover workflow responsible for a pre-selected cache entries removal or update cache_builder worfklow to work in an "overwrite" mode
This is a direct follow-up of Grobid-based metadata extraction integration (#1512).
Since rebuilding the whole metadata extraction cache from scratch is an extremely time consuming task we might need to focus on updating cache rather than rebuilding.
Apart from updating cache for a newly processed PDF files, which is already possible, we could also cherry-pick already existing cache entries in order to replace them with their new versions coming from Grobid. This could be useful to fix all the identified cases of wrongly extracted metadata from a PDF file.
In order to achieve that we need to create a dedicated worfklow, similar to cache_builder to some extent, which will accept the list of OAIds of records to be dropped from cache. It will not run grobid at all, it will just remove whatever we want to be updated. An update will be simply performed with subsequent execution of the cache_builder workflow.
We can safely rely on an already existing building blocks (subworkflows) of the cache_builder workflow with some minor tweaks and changes:
- defining a new
cache_removeruber-workflow based on the already existingcache_builderworkflow - a new uber-workflow shoud accept
input_iddatastore (similarly tocache_builder) which will provide the list of OAIds for which cached entries should be removed. This datastore will be created ouside of the scope of this task and it could be built with an arbitrary hive sql query producingIdentifieravro datastore - replacing
skip_extracted_without_metawithskip_extracteddefined astransformers_metadataextraction_skip_extractedsubworkflow in order to get all the matched entries from the cache (which will be a subject for removal) - [potentially optional] extract
idfields (conveying checksums) from theExtractedDocumentMetadatamatched records retrieved from cache - defining a new transformer script (could be defined in PIG just to have all the scripts aligned and written in the same language) responsible for preparing a new, updated version of the cache. It should have the following API:
- input: existing cache datastore version to be updated
- input: matched records coming from cache (or just
Identifierdatastore withidfield conveying the checksums to be removed from the existing version of the cache) - output: new version of the cache with all the matched fields removed
Alternatively we could also consider running cache_builder in an “overwrite” mode. This will require significant changes to be applied to an already existing cache_builder workflow. So it will kind of over-complicate an already existing workflow but on the other hand it will simplify the cache update process by simply running one updated workflow instead of two (to be introduced cache_remover and already existing cache_builder).