go
go copied to clipboard
Horizon Lite: Improve the performance and functionality of the batch-based indexer.
Context
There are several necessary improvements to the existing map-reduce batch job for index creation:
- poor performance: performance of
reduceis low when the target/source index is remote, for example, S3 (jobs don't complete, running forever and churning slowly on account/tx merging routines) - low visibility on performance: there's a lack of visibility on I/O rates due to the lack of metrics and logging.
- lack of flexibility: the
reducejob operates on all modules, even if the map job only specified on module.
Suggestions
- In the tx index merge routine, perform a query against the 'source' index that has map job output for the
tx/folder, skip iterating all 255 tx prefixes if the map job output does not have 'tx' folder. (This happens when map was configured to not includetransactionsin itsMODULES.) - We can change the entire map/reduce flow to use a shared persistent volume across all workers, then upload the volume to remote store once at end :
- have all
mapjobs write to a single on-disk volume or source of storage, - the
reducejobs merge them together to the same on-disk source, - final step uploads/syncs that disk to remote
targetindex.
- have all
- On account index merging, pre-download all the 'source' index mapped job account summary files, load those into a map of
job_id:accountid->true/false, then the worker -> account -> read-all-map-jobs-for-account loop can check for account presence first and avoid sending iterative network trips to remote 'source' index that will be empty response anyway.
Acceptance Criteria
It's entirely possible that this task can/should be broken down into many sub-tasks based on the above suggestions, but the general criteria for completion should be:
- [ ] Add more output on metrics such as upload times on both the map and reduce jobs.
- [ ] The reduce job does not do unnecessary work if the
mapjob did not apply all modules - per first suggestion above - [ ] The performance of the reduce batch job is significantly improved - per all three suggestions
We should also consider using something other than s3 since we may not end up using s3 in production (for cost reasons).
@Shaptic @2opremio , I re-worded the acceptance criteria per the scrum feedback to make this ticket's scope s3 agnostic and more about optimizing regardless of the 'target' index's interface(s3, file, others..)