cuvs icon indicating copy to clipboard operation
cuvs copied to clipboard

Persistent CAGRA kernel

Open achirkin opened this issue 1 year ago • 6 comments

An experimental version of the single-cta CAGRA kernel that run persistently while allowing many CPU threads submit the queries in small batches very efficiently.

CAGRA throughput @ Recall = 0.94, n_queries = 1 CAGRA throughput @ Recall = 0.94, n_queries = 10

API

In the current implementation, the public API does not change. An extra parameter persistent is added to the ann::cagra::search_params (only valid when algo == SINGLE_CTA). The persistent kernel is managed by a global runner object in a shared_ptr; the first CPU thread to call the kernel spawns the runner, subsequent calls/threads only update a global "heartbeat" atomic variable (runner_base_t::last_touch). When there's no heartbeat in the last few seconds (kLiveInterval), the runner shuts down the kernel and cleans up the associated resources.

An alternative solution would be to control the kernel explicitly, in a client-server style. This would be more controllable, but would require significant re-thinking of the RAFT/cuVS API.

Synchronization behavior and CUDA streams

The kernel is managed in a dedicated thread & a non-blocking stream; it's independent of any other (i.e. calling) threads.

Although we pass a CUDA stream to the search function to preserve the api, this CUDA stream is never used; in fact, there are no CUDA API calls happening in the calling thread. All communication between the host calling thread and GPU workers happens via atomic variables.

The search function blocks the CPU thread, i.e. it waits till the results are back before returning.

Exceptions and safety

The kernel runner object is stored in a shared pointer. Hence, it provides all the same safety guarantees as smart pointers. For example, if a C++ exception is raised in the runner thread, the kernel is stopped during the destruction of the runner/last shared pointer.

It's hard to detect if something happens to the kernel or CUDA context. If the kernel does not return the results to the calling thread within the configured kernel lifetime (persistent_lifetime ), the calling thread abandons the request and throws an exception. The designed behavior here is that all components can gracefully shutdown within the configured kernel lifetime independently.

Integration notes

lightweight_uvector

RMM memory resources and device buffers are not zero-cost, even when the allocation size is zero (a common pattern for conditionally-used buffers). They do at least couple cudaGetDevice calls during initialization. Normally, the overhead of this is negligible. However, when the number of concurrent threads is large (hundreds of threads), any CUDA call can become a bottleneck due to a single mutex guarding a critical section somewhere in the driver.

To workaround this, I introduce a lightweight_uvector in /detail/cagra/search_plan.cuh for several buffers used in CAGRA kernels. This is a stripped "multi-device-unsafe" version of rmm::uvector: it does not check during resize/destruction whether the current device has changed since construction. We may consider putting this in a common folder to use across other RAFT/cuVS algorithms.

Shared resource queues / ring buffers

resource_queue_t is an atomic counter-based ring buffer used to distribute the worker resources (CTAs) and pre-allocated job descriptors across CPU I/O threads. We may consider putting this in a common public namespace in raft if we envision more uses for it.

Persistent runner structs

launcher_t and persistent_runner_base_t look like they could be abstracted from the cagra kernel and re-used in other algos. The code in its current state, however, is not ready for this.

Adjusted benchmarks

  1. I introduced a global temporary buffer for keeping the intermediate results (e.g. neighbor candidates before refinement). This is needed to avoid unnecessary allocations alongside the persistent kernel (but also positively affects performance of the original non-persistent implementation)
  2. I adjusted cuvs common benchmark utils to avoid extra d2h copies and syncs during refinement.

achirkin avatar Jul 08 '24 14:07 achirkin

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Jul 12 '24 10:07 copy-pr-bot[bot]

Update on benchmarks: after a bit of algorithm tweaking, it seems the QPS/sleep behavior is now rather stable for both single-query case and other batch sizes. I added the plots to the PR description. Even with 8k threads on 32-core machine, the system stays responsive and the cumulative CPU load stays at 90%-100% (i.e. all threads don't suddenly go to sleep as before). To strain the system even more, I reduced the CTA size to 64 threads (i.e. gridDim > 2k), which increased the maximum QPS by 2x.

So far I tested it with a few different configurations (H100/A10/RTX3090Ti), it's more or less stable. With this, I'm not sure, which compile time constants in the code should better be exposed. Here's the summary:

// Artificially reduce the grid size by a constant < 1. Ideally we shouldn't need this, but this may be useful to make the system a bit more responsive (e.g. on workstations where the GPU is used for other purposes as well).
double kDeviceUsage = 1.0;
// The time to live for the kernel: if no calls in this time happened, the kernel stops.
// It may be useful to expose this in the production setting to balance high-percentile-latency vs CPU/GPU load.
auto kLiveInterval = std::chrono::milliseconds(2000);
// The size of the jobs queue, it's not smaller than max expected thread count.
// Can move it to search parameters, but:
//    1. Will have to change all arrays to std vectors and a few compile-time checks will not be possible
//    2. There's no harm in setting this to a large value, such as 8192
uint32_t kMaxJobsNum = 8192;
// The size of the worker queue = max number of CTAs.
// Same two arguments apply
uint32_t kMaxWorkersNum           = 4096;
// How many CTAs/workers can be hold by one IO thread at the same time & starting with how many
// the IO thread "should be more eager" to return them. I believe these should not be exposed, as the performance impact
// of them is unknown.
uint32_t kMaxWorkersPerThread     = 256;
uint32_t kSoftMaxWorkersPerThread = 16;
// The default time to sleep. Maybe it makes sense to expose this, but also maybe 
auto kDefaultLatency = std::chrono::nanoseconds(50000);
// This is the base for computing maximum time a thread is allowed to sleep.
// It's a good multiple of default latency
auto kMaxExpectedLatency = kDefaultLatency * std::max<std::uint32_t>(10, kMaxJobsNum / 128);

// -----------------------------------------------------------------------
// The rest are constants local to functions, which, I think too specific for anyone to adequately use/change

// Exponential moving average constant for tracking the configuration-specific expected latency
  inline ~launcher_t() noexcept  // NOLINT
  {
    // bookkeeping: update the expected latency to wait more efficiently later
    constexpr size_t kWindow = 100;  // moving average memory
    expected_latency         = std::min<std::chrono::nanoseconds>(
      ((kWindow - 1) * expected_latency + now - start) / kWindow, kMaxExpectedLatency);
  }
 
   [[nodiscard]] inline auto sleep_limit() const
  {
    constexpr auto kMinWakeTime  = std::chrono::nanoseconds(10000);
    constexpr double kSleepLimit = 0.6;
    return start + expected_latency * kSleepLimit - kMinWakeTime;
  }
  

  [[nodiscard]] inline auto calc_pause_factor(uint32_t n_queries) const -> uint32_t
  {
    constexpr uint32_t kMultiplier = 10;
    return kMultiplier * raft::div_rounding_up_safe(n_queries, idle_worker_ids.capacity());
  }
  
  
  
    inline void pause()
  {
    // Don't sleep this many times hoping for smoother run
    constexpr auto kSpinLimit = 3;
    // It doesn't make much sense to slee less than this
    constexpr auto kPauseTimeMin = std::chrono::nanoseconds(1000);
    // Bound sleeping time
    constexpr auto kPauseTimeMax = std::chrono::nanoseconds(50000);
    ...
}

achirkin avatar Jul 19 '24 18:07 achirkin

Thanks, Tamas, for the review round. I've exposed the two parameters you suggested, so these are the new parameters now:

  /** Whether to use the persistent version of the kernel (only SINGLE_CTA is supported a.t.m.) */
  bool persistent = false;
  /** Persistent kernel: time in seconds before the kernel stops if no requests received. */
  float persistent_lifetime = 2;
  /** Reduce the grid size of the persistent kernel artificially. */
  float persistent_device_usage = 1.0;

achirkin avatar Jul 24 '24 13:07 achirkin

+1 to @tfeher's request for more documentation. We should very clear and concise about why the feature exists, how to enable the feature, side effects to expect, how to use the feature (with examples) and things to look out for.

cjnolet avatar Jul 29 '24 20:07 cjnolet

Thanks for suggestion, @cjnolet and @tfeher . I've added a comprehensive example on how to use the persistent kernel into examples/cpp/src/cagra_persistent_example.cu.

achirkin avatar Jul 31 '24 14:07 achirkin

  • Library size in a recent build is 875 MB https://downloads.rapids.ai/ci/cuvs/pull-request/254/91f48f8/cuda11_x86_64.compile_lib.html
  • Library size after this PR 1.2 GB https://downloads.rapids.ai/ci/cuvs/pull-request/215/04c2745/cuda12_x86_64.compile_lib.html

tfeher avatar Jul 31 '24 16:07 tfeher

Binary size increase is almost 60MB: 620 -> 679 MB

However, it seems like we still have some unused search instances! e.g.

neighbors/detail/cagra/search_single_cta_float_uint64.cu
neighbors/detail/cagra/search_single_cta_half_uint64.cu

These are compiled, but not used, because there's no corresponding neighbors/cagra_search_xxx.cu for them.

achirkin avatar Sep 26 '24 15:09 achirkin

/merge

cjnolet avatar Sep 27 '24 01:09 cjnolet

@achirkin can you remove the unused instantiations in a follow up PR?

cjnolet avatar Sep 27 '24 01:09 cjnolet

Sure, I will remove the unused instantiations.

Also I've only got partial benchmark results due to the machine getting shut down by the runtime limit, but the results I got seem to suggest there's no performance drop since the last results in August https://docs.google.com/spreadsheets/d/1oe58EjBzKd1m3PTFsK7ON0ZsqQH0fzSxFhpMhWy8neg/edit?gid=17032142#gid=17032142

achirkin avatar Sep 27 '24 08:09 achirkin

@cjnolet I've removed these instances in #264, which is ready for review now. #264 uses many of the currently unused instances, but in total leads to small reduction in binary size.

achirkin avatar Sep 27 '24 11:09 achirkin