chapel icon indicating copy to clipboard operation
chapel copied to clipboard

Moving tasks to sublocales doesn't work in the library mode

Open e-kayrakli opened this issue 8 months ago • 2 comments

Consider the following pieces of code:

udf2.chpl:

module Foo {

  extern proc chpl_task_getRequestedSubloc(): int(32);
  extern proc printf(s...);

  export proc add_int32(ref result: [] int(32), const ref a: [] int(32),
                        const ref b: [] int(32)) {
    printf("before on\n");
    on here.gpus[0] { // `sync begin on` is needed
      printf("subloc : %d\n", chpl_task_getRequestedSubloc()); // prints -2 without `sync begin on`
      @assertOnGpu
      foreach i in 0..2 {
        result[i] = a[i] + b[i];
      }
    }
  }
} // Foo

test_udf2.cpp:

#include <iostream>

#include "udf2.h"

int main(int argc, char **argv) {
  chpl_library_init(argc, argv);
  chpl__init_Foo(0, 0);

  int32_t x[] = {5, 1, 4}; // these should be allocated on the GPU
  int32_t y[] = {5, 7, 8}; // but that's irrelevant for the issue here
  int32_t result[3];

  chpl_external_array col_result{result, 3, nullptr};
  chpl_external_array col_a{x, 3, nullptr};
  chpl_external_array col_b{y, 3, nullptr};

  add_int32(&col_result, &col_a, &col_b);

  for (int i = 0; i < 3; i++) {
    std::cout << "result[" << i << "] = " << result[i] << std::endl;
  }

  chpl_library_finalize();
  return 0;
}

When add_int32 is invoked, it is actually running directly on the same pthread as the C++ application. In other words, there's no qthread to speak of. Probably more accurately and sticking with the qthread lingo, we are actually running on a qthread shepherd. Our tasking layer seems to be able to handle that case relatively fine (for example chpl_task_getSerial can work in that condition, allowing us to run many of our parallel iterators), however adjusting sublocales is where it falls short. This is noted in the code in a comment with another potential issue with that approach.

This is currently a "silent failure" in the sense that moving a task from a host to device is a no-op and that we don't get a warning or error for doing that. In the short term, it seems like we should fix that. @assertOnGpu is another relevant tool for diagnosing the issue.

The workaround seems to be using sync begin on instead. Because we need to create a qthread so that we can move it to the GPU sublocale. I believe that's the long term solution to the problem as well -- we should be able to tell that the task is immoveable and trigger a sync begin and do on inside of the created task instead.


Fuller reproducer is in https://github.com/tpn/chpl-gpu-test, also note that https://github.com/chapel-lang/chapel/issues/25222 is relevant.

e-kayrakli avatar Jun 11 '24 22:06 e-kayrakli