[Bug] Memory leak occurred in QueryFrontend when all the user requests met SLO
To reproduce this bug, try below steps.
Step1 : Start hello-world sample (cache_size=0, slo_micros=100)
from clipper_admin import ClipperConnection, DockerContainerManager
from clipper_admin.deployers import python as python_deployer
def feature_sum(xs):
return [str(sum(x)) for x in xs]
clipper_conn = ClipperConnection(DockerContainerManager())
clipper_conn.start_clipper(cache_size=0) # Disable PredictionCache
clipper_conn.register_application(name="hello-world", input_type="doubles", default_output="-1.0", slo_micros=100)
python_deployer.deploy_python_closure(clipper_conn, name="sum-model", version=1, input_type="doubles", func=feature_sum)
clipper_conn.link_model_to_app(app_name="hello-world", model_name="sum-model")
Step2 : Login to QueryFrontend
$ docker exec -it <QueryFrontend_Container_ID> /bin/bash
Step3 : Check current memory usage. (Approx. 19 MiB)
root@6ddeb1430e8b:/# cat /sys/fs/cgroup/memory/memory.stat | grep -w rss
rss 20230144
Step4 : Send many requests to the QueryFrontend
import requests, json, numpy as np
headers = {"Content-type": "application/json"}
for i in range(10000):
requests.post("http://localhost:1337/hello-world/predict", headers=headers, data=json.dumps({"input": list(np.random.random(10))})).json()
Step5 : Check that all the requests met SLO
.....
{u'use_default': True, u'output': -1.0, u'query_id': 9996, u'default_explanation': u'Failed to retrieve a prediction response within the specified latency SLO'}
{u'use_default': True, u'output': -1.0, u'query_id': 9997, u'default_explanation': u'Failed to retrieve a prediction response within the specified latency SLO'}
{u'use_default': True, u'output': -1.0, u'query_id': 9998, u'default_explanation': u'Failed to retrieve a prediction response within the specified latency SLO'}
{u'use_default': True, u'output': -1.0, u'query_id': 9999, u'default_explanation': u'Failed to retrieve a prediction response within the specified latency SLO'}
>>>
Step5 : Check current memory usage. (Approx. 39 MiB)
root@6ddeb1430e8b:/# cat /sys/fs/cgroup/memory/memory.stat | grep -w rss
rss 41529344
I think that TaskExecutor's PredictionCache increased continuously because PredictionCache called PredictionCache::insert_entry() only, not PredictionCache::evict_entries().
Evict entry should be called on put when the cache buffer is full. But the cache size of 0.
https://github.com/ucbrise/clipper/blob/80c97d27a38d60caaebb2a1ae6a995dd7ff1c82d/src/libclipper/src/task_executor.cpp#L78
I’ll investigate it.
@simon-mo : Our team has met this error case on the real environment. Anyway, I'm continuing to debug this error.
- Add a new task to queue.
[info] [TASKEXE...] Adding task to queue. QueryID: 32, model: sum-model:1
-
This code removed a new task. So
batchreturned byget_batch()has zero size. https://github.com/ucbrise/clipper/blob/e0b12200ac8b3aeadd8a0ce037e6a96790b4eaab/src/libclipper/include/clipper/task_executor.hpp#L146 -
So we got this error message.
[error] [TASKEXE...] ModelQueue returned empty batch for model sum-model:1, replica 1f5fa694c905
PredictionCache::putdidn't be called because the new task didn't be executed.
I suggest that when removing tasks with elapsed deadlines, we have to remove the related <key, CacheEntry> item from PredictionCache also.
@simon-mo : This patch is just workaround.
diff --git a/src/libclipper/include/clipper/task_executor.hpp b/src/libclipper/include/clipper/task_executor.hpp
index 3f4f4710..840363fb 100644
--- a/src/libclipper/include/clipper/task_executor.hpp
+++ b/src/libclipper/include/clipper/task_executor.hpp
@@ -145,10 +145,12 @@ class ModelQueue {
std::shared_ptr<ModelContainer> requesting_container,
std::function<BatchSizeInfo(Deadline)> &&get_batch_size) {
std::unique_lock<std::mutex> lock(queue_mutex_);
- remove_tasks_with_elapsed_deadlines();
+ // - Clipper issue : https://github.com/ucbrise/clipper/issues/549
+ // remove_tasks_with_elapsed_deadlines();
queue_not_empty_condition_.wait(
lock, [this]() { return !queue_.empty() || !valid_; });
- remove_tasks_with_elapsed_deadlines();
+ // - Clipper issue : https://github.com/ucbrise/clipper/issues/549
+ // remove_tasks_with_elapsed_deadlines();
std::vector<PredictTask> batch;
if (requesting_container->is_active() && valid_) {
@withsmilo Is this issue handled in one of your recent PRs?
@rkooo567 No! I'm just applying the workaround patch to Clipper now because this issue is so complicated. After completing opened PRs, I will dig for this.