iis
iis copied to clipboard
Rewrite citation matching algorithm in spark 2.4
Currently citation matching algorithm is written in spark 1.6, as a part of Coansys module:
https://github.com/CeON/CoAnSys/tree/master/citation-matching/citation-matching-core-code
We should rewrite the code in spark 2.4 (used by all the other spark modules in IIS) in order to be able to set timeouts (such as spark.shuffle.registration.timeout which is available since spark 2.3) and to take advantage of performance improvements.
Currently the citation matching algorithm cannot be run on the current size of the graph due to shuffle server related timeouts.