cuvs icon indicating copy to clipboard operation
cuvs copied to clipboard

Optimize euclidean distance in host refine phase

Open anstellaire opened this issue 10 months ago • 9 comments

Issue

Original code (below) generated serial assembly and used strictly-ordered fadda instruction on ARM with gcc & clang. That resulted in suboptimal performance.

for (size_t k = 0; k < dim; k++) {
  distance += DC::template eval<DistanceT>(query[k], row[k]);
}

Proposed solution

This PR provides euclidean distance optimized with partial vector sum (below), that helps vectorization but loses strcictly-ordered compliance.

template <typename DC, typename DistanceT, typename DataT>
DistanceT euclidean_distance_squared_generic(DataT const* a, DataT const* b, size_t n) {
  size_t constexpr max_vreg_len = 512 / (8 * sizeof(DistanceT));

  // max_vreg_len is a power of two
  size_t n_rounded = n & (0xFFFFFFFF ^ (max_vreg_len - 1));
  DistanceT distance[max_vreg_len] = {0};

  for (size_t i = 0; i < n_rounded; i += max_vreg_len) {
    for (size_t j = 0; j < max_vreg_len; ++j) {
      distance[j] += DC::template eval<DistanceT>(a[i + j], b[i + j]);
    }
  }

  for (size_t i = n_rounded; i < n; ++i) {
    distance[i] += DC::template eval<DistanceT>(a[i], b[i]);
  }

  for (size_t i = 1; i < max_vreg_len; ++i) {
    distance[0] += distance[i];
  }

  return distance[0];
}

In addition, it has an implementation with NEON intrinsics which provides further speedup on certain test cases (can be removed if arch-specific code is undesired).

Results

image

anstellaire avatar Feb 13 '25 13:02 anstellaire

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Feb 13 '25 13:02 copy-pr-bot[bot]

/ok to test

cjnolet avatar Feb 13 '25 16:02 cjnolet

/ok to test

UPD: @cjnolet, seems like CI is triggered only by repository members, could you please do it one more time? I changed formatting with clang-format.

anstellaire avatar Feb 14 '25 09:02 anstellaire

/ok to test

lowener avatar Feb 19 '25 12:02 lowener

You have changed the distance computation for the large batch size case, but did not change for the small batch case (which is handled in a separate branch here). Is this because your benchmarks have shown no improvement for the small batch case?

Correct, on small batch size I saw a minor performance degradation, so I decided to apply optimization only to a large batch.

anstellaire avatar Mar 21 '25 09:03 anstellaire

/ok to test

tfeher avatar Mar 22 '25 09:03 tfeher

/ok to test

tfeher avatar Mar 22 '25 10:03 tfeher

/ok to test 2de39ef

tfeher avatar May 02 '25 10:05 tfeher

/ok to test 18fe20f

tfeher avatar May 02 '25 13:05 tfeher

/merge

tfeher avatar May 05 '25 09:05 tfeher