cccl icon indicating copy to clipboard operation
cccl copied to clipboard

[BUG]: stdpar multicore scan tests fail at runtime after commit 1238f287f4f654bff12e4d4a29b7dfa1c40a12fe

Open zkhatami opened this issue 2 months ago • 10 comments

Is this a duplicate?

  • [x] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Runtime Error

Component

Thrust

Describe the bug

stdpar multicore scan tests have started failing in correctness after the following commit: 1238f287f4f654bff12e4d4a29b7dfa1c40a12fe For this example:

#include <execution>
#include <algorithm>
#include <numeric>
#include <cassert>

constexpr int N = 10;

int main() {
  int* data = new int[N];
  int* output = new int[N];
  std::iota(data, data + N, 0);
  auto r0 = std::exclusive_scan(std::execution::par, data, data + N, output, 10);
  assert(r0 == output + N);
  delete[] output;
  delete[] data;
}

The program fails at runtime with the following assertion error:

int main(): Assertion `r0 == output + N' failed.

How to Reproduce

nvc++ -stdpar=multicore -fast --c++17 scan.cpp

Expected behavior

The program should pass the assertion (r0 == output + N) and complete successfully.

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

zkhatami avatar Oct 22 '25 17:10 zkhatami

Hi @zkhatami!

Thanks for submitting this issue - the CCCL team has been notified and we'll get back to you as soon as we can! In the mean time, feel free to add any relevant information to this issue.

github-actions[bot] avatar Oct 22 '25 17:10 github-actions[bot]

This line is wrong

return result;

should be

return result + n;

dkolsen-pgi avatar Oct 22 '25 17:10 dkolsen-pgi

The fix doesn’t seem to be comprehensive enough, as this case still fails:

#include <algorithm>
#include <execution>
#include <functional>
#include <numeric>
#include <vector>
#include <random>
#include <cassert>

int main() {
  std::mt19937 gen(1729);
  std::vector<unsigned int> s(1729);
  std::generate(s.begin(), s.end(), std::ref(gen));
  std::vector<unsigned int> d(1729);
  assert(d.end() == inclusive_scan(std::execution::par, s.cbegin(), s.cend(), d.begin(), std::multiplies<>{}));
  std::partial_sum(s.cbegin(), s.cend(), s.begin(), std::multiplies<>{});
  assert(s == d);
}
nvc++  -stdpar=multicore --c++17 test.cpp ; ./a.out
a.out: parallel_inclusive_scan.cpp:16: int main(): Assertion `s == d' failed.

zkhatami avatar Nov 03 '25 18:11 zkhatami

@bernhardmgruber would you please take a look?

zkhatami avatar Nov 03 '25 18:11 zkhatami

i think here as what i understand it is unsigned int{} which gets initialized to 0 producing an error.

maybe, if we use first element as the init value instead, it would work?

charan-003 avatar Nov 09 '25 08:11 charan-003

@charan-003 is absolutely right. You cannot assume that accum_t{} is a valid initial value for anything. If the element type is int and the binary operation is multiplication, then using accum_t{} will incorrectly set the entire result to zero. In fact you can't even assume that accum_t{} is well formed. There is no guarantee that the accumulator type has a default constructor.

@charan-003 is also correct that you can fix step 1 by using the first element as the initial value and reducing over the remaining elements. But the accum_t{} in step 2 is still a problem, and that one is harder to fix.

dkolsen-pgi avatar Nov 09 '25 17:11 dkolsen-pgi

@dkolsen-pgi thanks for the reply.

I made few changes in this PR,I tested both test cases from the issue, and they're passing:

Test 1:

#include <execution>
#include <algorithm>
#include <numeric>
#include <cassert>

constexpr int N = 10;

int main() {
  int* data = new int[N];
  int* output = new int[N];
  std::iota(data, data + N, 0);
  auto r0 = std::exclusive_scan(std::execution::par, data, data + N, output, 10);
  assert(r0 == output + N);
  delete[] output;
  delete[] data;
}

Test 2:

#include <algorithm>
#include <execution>
#include <functional>
#include <numeric>
#include <vector>
#include <random>
#include <cassert>

int main() {
  std::mt19937 gen(1729);
  std::vector<unsigned int> s(1729);
  std::generate(s.begin(), s.end(), std::ref(gen));
  std::vector<unsigned int> d(1729);
  assert(d.end() == inclusive_scan(std::execution::par, s.cbegin(), s.cend(), d.begin(), std::multiplies<>{}));
  std::partial_sum(s.cbegin(), s.cend(), s.begin(), std::multiplies<>{});
  assert(s == d);
}

Just wondering if there are any other test cases I should consider, or if something else is missing

charan-003 avatar Nov 10 '25 01:11 charan-003

Just wondering if there are any other test cases I should consider, or if something else is missing

The original change overhauled the implementation of the scan algorithms in the Thrust OpenMP back end. Parallel scan implementations can be very tricky with nasty edge cases. To be confident that everything is correct requires thorough testing: inclusive_scan, transform_inclusive_scan (both of them with and without an initial value), exclusive_scan, and transform_exclusive_scan. Test with a non-commutative binary operator such as string concatenation. Test with a type whose default constructor gives the object a poison value to catch if default values are used anywhere. Test with the input and output being in the same location. (That can be particularly tricky for exclusive_scan, but I think this implementation gets that right.)

(I recently implemented parallel scan algorithms with OpenACC. It was quite challenging. A thorough test suite was extremely helpful.)

dkolsen-pgi avatar Nov 10 '25 02:11 dkolsen-pgi

@charan-003 I added a few more tests to #6560. Please add further tests as you see fit!

bernhardmgruber avatar Nov 10 '25 09:11 bernhardmgruber

@dkolsen-pgi @bernhardmgruber Thanks for the detailed feedback! I've added tests covering all the cases you mentioned.

Added:

  • inclusive_scan and exclusive_scan (with and without init)
  • transform_inclusive_scan and transform_exclusive_scan
  • Non-additive operations (multiplication operator)
  • In-place scans (input == output)
  • Custom type with poison default constructor to catch uninitialized values
  • Large arrays (10k elements), boundary cases (1024 at threshold), and small arrays (2 elements)

Happy to make more changes as required :))

charan-003 avatar Nov 12 '25 09:11 charan-003