ann-benchmarks Running with Parallelism on AWS C5.4x Large Instance Leads to OOM Errors

Running with Parallelism on AWS C5.4x Large Instance Leads to OOM Errors

Open ShikharJ opened this issue 3 years ago • 2 comments

Hey everyone,

This is an issue I faced while trying to generate the QPS vs Recall plots for ann-benchmarks. The issue is that when building the algorithms with parallelism = 3, the available memory gets divided equally between the docker containers running the three algorithms.

A typical AWS C5.4x large instance comes with 30 GBs of total RAM available. However, when building on top of the ann-benchmarks infrastructure, the maximum RAM availability that I've seen is close to 20 GBs. Hence 20 / 3 = 6.67GBs is the final RAM available while building and running to an algorithm.

This leads to an Out-Of-Memory (OOM) error, wherein the docker daemon kills the container running the algorithm build if the required memory exceeds the available memory. Hence certain configurations of a heavy algorithm never build to completion, and as such the QPS vs Recall plot is significantly different from the scenario where every algorithm gets to use the full 20 GBs of RAM (in simpler words, everything is run sequentially).

For this reason, I believe the QPS vs Recall plots are slightly misleading in their current state as certain build configs wouldn't even run to completion, messing up the final pareto-optimal plot. I believe this also explains why so many dataset plots don't contain results from the full set of algorithms (one example being this). Attaching screenshot for reference.

Exit Code 137 refers to OOM killed containers. Command: sudo python3 run.py --dataset glove-100-angular --parallelism 3

Apr 17 '21 19:04 ShikharJ

Thanks so much @ShikharJ . I think that this is indeed a huge problem. What do you think, @erikbern ? It seems the r6g type instances might be better for running the benchmark in the future?

Apr 19 '21 16:04 maumueller

Last time I ran the benchmarks, some fraction of containers crashed because of OOM issues, and I personally think it's not a super big deal. I prefer if we're more liberal with the algo definitions and you can include a larger range of parameter combinations.

But if 50% of all algos crash then maybe we can run on a bigger instance type next time, I'm open to that.

Apr 27 '21 02:04 erikbern

ann-benchmarks ann-benchmarks copied to clipboard

Running with Parallelism on AWS C5.4x Large Instance Leads to OOM Errors

ann-benchmarks
ann-benchmarks copied to clipboard