iis icon indicating copy to clipboard operation
iis copied to clipboard

Devise an automated procedure to export plaintexts

Open marekhorst opened this issue 7 years ago • 0 comments

As a result of this procedure we would like to obtain datastore with documentId and text pairs. It should be possible to export such datastore to external hadoop cluster.

This could be achieved by defining workflow definition binding already available IIS modules:

  • iis-wf-import
  • iis-wf-metadataextraction
  • iis-wf-ingest-pmc
  • iis-wf-transformers

and we could define it inside iis-wf-metadataextraction module.

Technically it will extract data from underlying PDF and XML-PMC caches. We could reuse already existing workflows:

and supplement it with exporting module producing outcome in desired format. Dedicated workflow could be based on already existing primary_import workflow definition.

We should support the following input parameters:

  • match_content_with_metadata flag indicating contents should be filtered against metadata entries retrieved from InformationSpace. Publications identifiers will be replaced by deduplicated ones. This way only contents having metadata representation will be exported. When disabled hbase_dump_location may not be provided.
  • hbase_dump_location InfoSpace dump location (may point to remote cluster) required for filtering contents against their metadata representatives and mapping original identifiers into deduplicated ones
  • objectstore_service_location ObjectStore service location
  • approved_objectstores_csv predefined set of ObjectStores to be handled
  • ingest_pmc_cache_location - XML PMC texts cache location
  • metadataextraction_cache_location - PDF texts cache location
  • metadataextraction_excluded_checksums - set of excluded PDF checksums, should be defined in config-default.xml file placed on IIS cluster instead of being provided at-runtime
  • output_remote_location desired output location, may point to external cluster (e.g. DM)
  • reports_external_path local IIS cluster path where the processing metrics should be stored

marekhorst avatar Oct 10 '18 12:10 marekhorst