iis
iis copied to clipboard
Devise an automated procedure to export plaintexts
As a result of this procedure we would like to obtain datastore with documentId and text pairs. It should be possible to export such datastore to external hadoop cluster.
This could be achieved by defining workflow definition binding already available IIS modules:
iis-wf-importiis-wf-metadataextractioniis-wf-ingest-pmciis-wf-transformers
and we could define it inside iis-wf-metadataextraction module.
Technically it will extract data from underlying PDF and XML-PMC caches. We could reuse already existing workflows:
import_infospacefor importing identifiers deduplication mappingimporter_content_url_chainfor importing contents urls required by metadataextraction and pmc_ingestion submodulesmetadataextraction_cacheto handle both PDF and XML-PMC cachestransformers_metadataextraction_documenttextfor plaintext extraction out ofExtractedDocumentMetadatacached records
and supplement it with exporting module producing outcome in desired format. Dedicated workflow could be based on already existing primary_import workflow definition.
We should support the following input parameters:
match_content_with_metadataflag indicating contents should be filtered against metadata entries retrieved from InformationSpace. Publications identifiers will be replaced by deduplicated ones. This way only contents having metadata representation will be exported. When disabledhbase_dump_locationmay not be provided.hbase_dump_locationInfoSpace dump location (may point to remote cluster) required for filtering contents against their metadata representatives and mapping original identifiers into deduplicated onesobjectstore_service_locationObjectStore service locationapproved_objectstores_csvpredefined set of ObjectStores to be handledingest_pmc_cache_location- XML PMC texts cache locationmetadataextraction_cache_location- PDF texts cache locationmetadataextraction_excluded_checksums- set of excluded PDF checksums, should be defined inconfig-default.xmlfile placed on IIS cluster instead of being provided at-runtimeoutput_remote_locationdesired output location, may point to external cluster (e.g. DM)reports_external_pathlocal IIS cluster path where the processing metrics should be stored