Reduce IIS execution time
Overview
This is a summary issue with a general task of reducing IIS execution time.
Description
IIS execution time currently ranges up to two days. The bulk of this time should be due to the execution of madis scripts and there is a dedicated ticket for revising mining performance 4177. The goal of this issue is to implement changes in IIS workflows to reduce mining time with current versions of madis scripts.
Solutions
Caching
This solution should bring major gains in reducing execution time. The idea is to decouple execution of long running subworkflows from the execution of parent workflow. Subworkflows could be executed asynchronously with their results cached. The execution of parent workflow should pick up cached results and use them for further processing.
One of the first candidate for caching implementation is TARA reference extraction, which takes up to 20h.
Spark2
Spark2 brings new optimizations to execution of dataframe and dataset jobs. The idea is to rewrite spark RDD jobs to use datasets/dataframes together with taking advantage of optimization opportunities like custom aggregators. This solution can bring some gains, however the actual size of the gains may depend on the details of the job. Probably some experimentation should be performed to select jobs that could actually gain in transfer to dataset/dataframe implementation.
For clarification: caching solutions in IIS are not intended to be run asynchronously. Caching is intended to be a side-effect of normal execution of a job or workflow. First execution should build the cache of results and processed data and further executions should pick up cached data and filter input to run job/workflow only on new data.
TARA project reference extraction caching:
- issue #1099 - a sub-issue of this issue with detailed information regarding caching
- PR #1098 - implementation of TARA project reference extraction with cache