clipper
clipper copied to clipboard
QueryFrontend's memory usage increases continuously
I have deployed so many different versions of the same model repeatedly using Clipper to our cluster for a week. As the result, QueryFrontend's memory usage increased continuously. (95MB -> 250MB) I think that ActiveContainers list of QueryFrontend is growing up quickly, but old ModelContainer
s are not deleted from the ActiveContainers list.
Is this intentional?
It's intended behavior but not desired behavior. #378 detects when old containers are no longer connected and GCs them from the active containers list, which should address this issue.
@dcrankshaw : Thanks for your reply. For serving side of stability, I think that this issue is critical. We will apply some workaround fixes before merging with #378.
I would like to share some test results with you. My Clipper's QueryFrontend maintains latest ModelContainer only in the ActiveContainers list (containers_), but QueryFrontend's memory usage still increases continuously.
Where do you think the cause is? ==> I think that RPCService's connections_containers_map
is increasing continuously. Thank you!
That's a good point. @amogkam is going to look into cleaning up the connections map now that #378 got merged.
@withsmilo #378 should actually remove inactive containers from both the active_containers
list and connections_containers_map
.
If the problem is persistent even with this PR merged, then I can dig into it further, but for now it looks like everything should be good.
@dcrankshaw @amogkam : Thank you very much for quick review and merging of #378 PR. But after merging #378 with my branch, I found the new following problems.
1. QueryFrontend sometimes crashed and caused the critical segment fault.
How about you? I cannot be certain of that this problem is for Clipper develop branch or not because my branch is a little different from it. Anyway, to solve this crash, I googled this so post and modified RPCService::check_container_activity() function like below and the crash was disappeared finally.
void RPCService::check_container_activity(
std::unordered_map<std::vector<uint8_t>, ConnectedContainerInfo,
std::function<size_t(const std::vector<uint8_t> &vec)>>
&connections_containers_map) {
std::chrono::system_clock::time_point current_time =
std::chrono::system_clock::now();
std::vector<std::vector<uint8_t>> needs_removing;
for (auto it : connections_containers_map) {
auto &container_info = it.second;
if (std::chrono::duration_cast<std::chrono::milliseconds>(
current_time - std::get<2>(container_info))
.count() > CONTAINER_ACTIVITY_TIMEOUT_MILLS) {
/** if the amount of time that has elapsed between the current time and
the time of last
receiving from the container is greater than the threshold, then we want
to
call the inactive_container_callback_ */
VersionedModelId vm = std::get<0>(container_info);
int replica_id = std::get<1>(container_info);
GarbageCollectionThreadPool::submit_job(inactive_container_callback_, vm,
replica_id);
log_info(LOGGING_TAG_RPC, "lost contact with a container");
needs_removing.push_back(it.first);
}
}
for (auto key : needs_removing) {
connections_containers_map.erase(key);
}
}
2. Remain useless VersionedModelId information in active_containers
.
Let's remove it like below.
void ActiveContainers::remove_container(VersionedModelId model,
std::string replica_id) {
....
assert(containers_[model].size() == initialSize - 1);
if (containers_[model].size() == 0) {
log_info_formatted(
LOGGING_TAG_CONTAINERS,
"All containers of model: {}, version: {} are removed. Remove itself",
model.get_name(), model.get_id());
containers_.erase(model);
}
log_active_containers();
}
3. After calling GarbageCollectionThreadPool::submit_job, QueryFrontend's CPU usage increases insanely. (up to 99%)
Is this GarbageCollectionThreadPool's bug?
4. QueryFrontend's memory usage STILL increases continuously.
I found that the following data structures need to clean up.
- ActiveContainers
- ~~containers_ : std::unordered_map<VersionedModelId, std::map<std::string, std::shared_ptr<ModelContainer>>>~~
- batch_sizes_ : std::unordered_map<VersionedModelId, int>
- TaskExecutor
- model_queues_ : std::unordered_map<VersionedModelId, std::shared_ptr<ModelQueue>>
- model_metrics_ : std::unordered_map<VersionedModelId, ModelMetrics>
- RPCService
- ~~connections_containers_map : std::unordered_map<std::vector<uint8_t>, std::pair<VersionedModelId, string>, std::function<size_t(const std::vector<uint8_t> &vec)>>~~
- TaskExecutionThreadPool
- queues_ : std::unordered_map<size_t, ThreadSafeQueue<std::unique_ptr<IThreadTask>>>
- threads_ : std::unordered_map<size_t, std::thread>
and I am digging into others...
@withsmilo Thank you for the detailed response and for digging into this issue! The changes that you made make sense to me. Could you add these changes in check_container_activity
and remove_container
to your PR?
@amogkam :
Current my PR solved 3. GarbageCollectionThread's bug
only, but I will add some codes to solve 1. check_container_activity's bug
and 2. remove_container's bug
to it.
@amogkam : sorry for late. I pushed some commits to my PR. I found that QueryFrontend's memory usage STILL increases continuously, so I am digging into this issue.