mindbender
mindbender copied to clipboard
Single-threaded jq can be a bottleneck for ES indexing
We could backport this change to parallelize indexing: https://github.com/HazyResearch/mindbender/commit/bc869e855b62104928506d15611fb2329c786b12
It simply uses parallel
instead of split
. These improvements for a backport would be great:
- check for presence of
parallel
- configurable parallelization params
Here is another indexing speed optimization: https://github.com/HazyResearch/mindbender/commit/8d4169ab6784236f21e3caf7c794830f54b66357
Thanks for the suggestions! Yeah I was anticipating we'd need parallel indexing pretty soon. I had bad experience with GNU parallel–it was unstable, bloated, CLI changing too much across versions–but will backport these soon maybe using the more familiar xargs or embedding an exact version of parallel.
Side question: After parallelizing, is there any sign of ES being the new bottleneck? Would adding more nodes to the ES cluster help? The keep-elasticsearch-during
currently launches an isolated single node ES server, but we could enhance it and introduce a subcommand like mindbender search join-cluster
to make it easy to scale out.
No, ES seems to have a very flexible thread pool scheme in one node and can saturate all cores. I suspect that even if there is only one shard, it's still able to saturate all cores. If hardware is the bottleneck, then yeah, we could add new node support.
I see. Sounds like deciding the cluster size should depend on query time latency requirement.
Another key performance knob is ES_HEAP_SIZE: https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html
But default ES's heap size is 0.25-1G. We may want to use a different default.