clipper icon indicating copy to clipboard operation
clipper copied to clipboard

QueryFrontend's memory usage increases continuously

Open withsmilo opened this issue 6 years ago • 9 comments

I have deployed so many different versions of the same model repeatedly using Clipper to our cluster for a week. As the result, QueryFrontend's memory usage increased continuously. (95MB -> 250MB) I think that ActiveContainers list of QueryFrontend is growing up quickly, but old ModelContainers are not deleted from the ActiveContainers list.

Is this intentional?

withsmilo avatar Apr 13 '18 00:04 withsmilo

It's intended behavior but not desired behavior. #378 detects when old containers are no longer connected and GCs them from the active containers list, which should address this issue.

dcrankshaw avatar Apr 15 '18 05:04 dcrankshaw

@dcrankshaw : Thanks for your reply. For serving side of stability, I think that this issue is critical. We will apply some workaround fixes before merging with #378.

withsmilo avatar Apr 15 '18 13:04 withsmilo

I would like to share some test results with you. My Clipper's QueryFrontend maintains latest ModelContainer only in the ActiveContainers list (containers_), but QueryFrontend's memory usage still increases continuously.

Where do you think the cause is? ==> I think that RPCService's connections_containers_map is increasing continuously. Thank you!

withsmilo avatar Apr 18 '18 13:04 withsmilo

That's a good point. @amogkam is going to look into cleaning up the connections map now that #378 got merged.

dcrankshaw avatar Apr 20 '18 20:04 dcrankshaw

@withsmilo #378 should actually remove inactive containers from both the active_containers list and connections_containers_map.

If the problem is persistent even with this PR merged, then I can dig into it further, but for now it looks like everything should be good.

amogkam avatar Apr 20 '18 21:04 amogkam

@dcrankshaw @amogkam : Thank you very much for quick review and merging of #378 PR. But after merging #378 with my branch, I found the new following problems.

1. QueryFrontend sometimes crashed and caused the critical segment fault.

How about you? I cannot be certain of that this problem is for Clipper develop branch or not because my branch is a little different from it. Anyway, to solve this crash, I googled this so post and modified RPCService::check_container_activity() function like below and the crash was disappeared finally.

void RPCService::check_container_activity(
    std::unordered_map<std::vector<uint8_t>, ConnectedContainerInfo,
                       std::function<size_t(const std::vector<uint8_t> &vec)>>
        &connections_containers_map) {
  std::chrono::system_clock::time_point current_time =
      std::chrono::system_clock::now();

  std::vector<std::vector<uint8_t>> needs_removing;
  for (auto it : connections_containers_map) {
    auto &container_info = it.second;
    if (std::chrono::duration_cast<std::chrono::milliseconds>(
            current_time - std::get<2>(container_info))
            .count() > CONTAINER_ACTIVITY_TIMEOUT_MILLS) {
      /** if the amount of time that has elapsed between the current time and
      the time of last
      receiving from the container is greater than the threshold, then we want
      to
      call the inactive_container_callback_ */

      VersionedModelId vm = std::get<0>(container_info);
      int replica_id = std::get<1>(container_info);
      GarbageCollectionThreadPool::submit_job(inactive_container_callback_, vm,
                                              replica_id);

      log_info(LOGGING_TAG_RPC, "lost contact with a container");
      needs_removing.push_back(it.first);
    }
  }
  for (auto key : needs_removing) {
    connections_containers_map.erase(key);
  }
}

2. Remain useless VersionedModelId information in active_containers.

Let's remove it like below.

void ActiveContainers::remove_container(VersionedModelId model,
                                        std::string replica_id) {
....
  assert(containers_[model].size() == initialSize - 1);

  if (containers_[model].size() == 0) {
    log_info_formatted(
        LOGGING_TAG_CONTAINERS,
        "All containers of model: {}, version: {} are removed. Remove itself",
        model.get_name(), model.get_id());
    containers_.erase(model);
  }

  log_active_containers();
}

3. After calling GarbageCollectionThreadPool::submit_job, QueryFrontend's CPU usage increases insanely. (up to 99%)

Is this GarbageCollectionThreadPool's bug?

4. QueryFrontend's memory usage STILL increases continuously.

I found that the following data structures need to clean up.

  • ActiveContainers
    • ~~containers_ : std::unordered_map<VersionedModelId, std::map<std::string, std::shared_ptr<ModelContainer>>>~~
    • batch_sizes_ : std::unordered_map<VersionedModelId, int>
  • TaskExecutor
    • model_queues_ : std::unordered_map<VersionedModelId, std::shared_ptr<ModelQueue>>
    • model_metrics_ : std::unordered_map<VersionedModelId, ModelMetrics>
  • RPCService
    • ~~connections_containers_map : std::unordered_map<std::vector<uint8_t>, std::pair<VersionedModelId, string>, std::function<size_t(const std::vector<uint8_t> &vec)>>~~
  • TaskExecutionThreadPool
    • queues_ : std::unordered_map<size_t, ThreadSafeQueue<std::unique_ptr<IThreadTask>>>
    • threads_ : std::unordered_map<size_t, std::thread>

and I am digging into others...

withsmilo avatar Apr 21 '18 17:04 withsmilo

@withsmilo Thank you for the detailed response and for digging into this issue! The changes that you made make sense to me. Could you add these changes in check_container_activity and remove_container to your PR?

amogkam avatar Apr 25 '18 07:04 amogkam

@amogkam : Current my PR solved 3. GarbageCollectionThread's bug only, but I will add some codes to solve 1. check_container_activity's bug and 2. remove_container's bug to it.

withsmilo avatar Apr 25 '18 15:04 withsmilo

@amogkam : sorry for late. I pushed some commits to my PR. I found that QueryFrontend's memory usage STILL increases continuously, so I am digging into this issue.

withsmilo avatar May 14 '18 02:05 withsmilo