Marek Horst issues

Results 82 issues of


                                            Marek Horst

Consider making an affiliation matching tests logging outcome on TRACE level visible in console

Currently the information generated by the `AffOrgMatchVoterStrengthEstimatorAndTest` is logged on the `TRACE` level which is not being logged by default. This makes working on reestimation of the voters' strength rather...

functionality: affiliations

Make sure the zookeeper lock obtained by CachedWebCrawlerJob is always released when the job gets interrupted

It turned out if the first attempt of the `CachedWebCrawlerJob` failed due to shuffle service connectivity issue: ``` 2024-11-16 22:38:14,911 [shuffle-client-6-1] ERROR org.apache.spark.network.client.TransportResponseHandler - Still have 1 requests outstanding when...

functionality: referenceextraction

Simplify spark jobs configuration after having them all aligned with Spark 2.4

This could be considered as #1475 follow-up because citation matching was the last module written in Spark 1.6. The following properties defined in workflow.xml files: ``` spark2ExtraListeners com.cloudera.spark.lineage.NavigatorAppListener spark 2.*...

Integrate DAS fulltext mining algorithm for SciLifeLab

Originally requested in: https://support.openaire.eu/issues/10757. The goal is to integrate the Data Availability Statement (DAS) text-mining module for the Uppsala (SciLifeLab) tender.

functionality: referenceextraction

Provide a specific version in provenace for Grobid

TEI record produced by Grobid includes, apart from the publication metadata, also the version of Grobid responsible for creation of a given TEI XML record: ``` GROBID - A machine...

functionality: metadataextraction

Avoid placing temporary errors related to communication with Grobid as permanent faults in the cache

During the extensive tests it turned out all the Grobid communication related errors are stored as `Fault`s in cache what makes given PDF extracted empty metadata to be permanently stored...

functionality: metadataextraction

Align on the way bibref author names are being parsed by the NLM and TEI XML parsers

Currently the `TeiToExtractedDocumentMetadataTransformer`, working on top of the Grobid TEI XML output, parses the authors defined in the bibliographic reference section by traversing the XML author subelement: ``` Biosynthesis of...

functionality: metadataextraction

Drop ISLookup support after context profiles removal from the D-Net Information System

Since context profiles were removed from the D-Net Information System we can completely remove the legacy ISLookup based concepts importer and make the newly introduced streaming API based importer a...

Set proper extractedBy field value indicating its origin for Grobid-based processing

This is a #1560 follow-up. Grobid-based metadata extraction needs to set an appropriate `extractedBy` field value. Important remark: exception handling for Grobid-based metadata extraction results in setting an empty record....

functionality: metadataextraction

Implement a workflow responsible for importing HTML landing pages obtained by the PDF Aggregation System and sending them to the mining algorithm

Originally requested in redmine: https://support.openaire.eu/issues/9871#note-10 The idea is to implement and integrate a workflow responsible for: * reading HTML landing pages from tar.gz packages stored by the PDF Aggregation System...

activity: impl