Marek Horst
Marek Horst
Dear Dominika, this is kind of related to #32 issue reported some time ago. I have just found yet another file blocking CERMINE execution: https://arxiv.org/pdf/1804.09018.pdf where setting timeout parameter (#7...
Originally reported in: https://github.com/openaire/iis/issues/1326 Documents similarity algorithm fails after running it on a non-deduplicated OpenAIRE Graph counting 300M of publications (deduped graph included 200M). After in depth inspection covered by...
Setting `mapreduce.task.timeout` value to `7200000` in `sim1-postprocess-s1-e1-filter-sims.pig`.
After comparing two different ranking scripts: [document-similarity-s1-rank_filter.pig](https://github.com/CeON/CoAnSys/blob/f0373f0416c93d3cff3a6ea137da09fba1481cad/document-similarity/document-similarity-logic/src/main/pig/document-similarity-s1-rank_filter.pig) and [document-similarity-s1-ship-rank_filter.pig](https://github.com/CeON/CoAnSys/blob/f0373f0416c93d3cff3a6ea137da09fba1481cad/document-similarity/document-similarity-logic/src/main/pig/document-similarity-s1-ship-rank_filter.pig) and deeper inspection of the [document-similarity-s1-rank_filter.pig](https://github.com/CeON/CoAnSys/blob/f0373f0416c93d3cff3a6ea137da09fba1481cad/document-similarity/document-similarity-logic/src/main/pig/document-similarity-s1-rank_filter.pig) script it seems the `removal_least_used` is improperly used: it should be compared against the number...
This issue was originally reported in https://github.com/openaire/iis/issues/927 but since it requires changes in CoAnSys PIG script I am reporting it once again here. Pig RANK operation related problems ware mitigated...
Some time ago an alternative approach to `rank`ing operation was introduced: https://github.com/CeON/CoAnSys/blob/298863befc2f0e3a96b25a9ee53f6b53b41090a6/document-similarity/document-similarity-logic/src/main/pig/document-similarity-s1-ship-rank_filter.pig involving custom rank operation written in `rank.py` script introduced in 318d88ce7509b366c5428ac22135bc05421ad088 commit. An alternative oozie execution path could...
After inspecting several PIG scripts it seems there are plenty of redundant HDFS writes, followed by instant reads which could be optimized. Apparently this was used for debugging purposes so...
I am working on IIS dependencies cleanup and I realised we have one special exclusion case (among 2-3 others): excluding hadoop libraries from CoAnSys dependencies used in IIS to prevent...