Marek Horst

Results 82 issues of Marek Horst

This feature should improve madis scripts performance and was requested on redmine: [#4177#note-2](https://issue.openaire.research-infrastructures.eu/issues/4177#note-2). Currently python commands defined in: * `workflow.xml` files * `*DBBuilder.java` java code * [classify_documents.sh](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-documentsclassification/src/main/resources/eu/dnetlib/iis/wf/documentsclassification/oozie_app/lib/scripts/classify_documents.sh) script rely on...

activity: impl
functionality: referenceextraction

We should consider moving `mapreduce.task.timeout` configuration out of `workflow.xml` file to externally defined `config-default.xml` file to allow easy reconfiguration without the need of modifying part of IIS code. In the...

Originally reported on redmine: [#1756#note-181](https://issue.openaire.research-infrastructures.eu/issues/1756#note-181). Apparently web crawler module, which is responsible for providing HTML pages describing software, needs to be supplemented with javascript code execution. After inspecting HTML pages...

activity: impl
functionality: referenceextraction

As a result of this procedure we would like to obtain datastore with `documentId` and `text` pairs. It should be possible to export such datastore to external hadoop cluster. This...

activity: impl
functionality: export
functionality: metadataextraction

Integrate #842 pull request. Originally requested in redmine: [#3269#note-11](https://issue.openaire.research-infrastructures.eu/issues/3269#note-11).

activity: impl
functionality: referenceextraction

Originally requested in redmine ticket: [#3162](https://issue.openaire.research-infrastructures.eu/issues/3162). We should introduce blacklisting mechanism as a part of IIS exporting phase. This would affect outcome of the following inference modules: * research initiatives...

activity: impl
functionality: export

This task is mostly about running first experiments involving Grobid. We could start with implementing `pl.edu.icm.cermine.ContentExtractor` equivalent involving Grobid for metadata and plaintext extraction and make it possible to run...

functionality: import

This was originally described in redmine: https://support.openaire.eu/issues/5203#note-29. While importing another batch of origins from SH endpoint in turned out the process fails after receiving `SocketTimeoutException` even though the timeout was...

activity: bug
functionality: import

Apart from handling `SocketTimeoutException` in a proper way by allowing retry mechanism to kick-in I added two unit tests proving this mechanism works as expected. Adding one error log line...

Integrate the #1473 pull request.

functionality: referenceextraction