CHAI icon indicating copy to clipboard operation
CHAI copied to clipboard

Issue with ManagedArray::Operator[] and RAJA

Open rcarson3 opened this issue 3 years ago • 3 comments

@davidbeckingsale and others

I recently ran into a case for an application I'm working on where I had the initial chai::ExecutionSpace set to chai::ExecutionSpace::GPU for a chai::ManagedArray, and then I had it run within a RAJA forall loop on the CPU to initialize the data for some unit tests. Later on when trying to access the data on the host using the ManagedArray::Operator[] I would get a segfault. My understanding talking with @davidbeckingsale offline is that this should work as the data should now be allocated on the host/cpu.

I've included a MWE representing the issue.

Also, one other issue I noted was that doing a lambda capture by reference for the host RAJA loop also resulted in a segfault. I've included that below as well as commented out section of code. I will note that I've found this sort of lambda capture works fine if the original ExecutionSpace is set to the CPU.

Also, I'm using the following hashes for CHAI, RAJA, and Umpire: CHAI: df3e8e0f1b4350622bb9a68cfc7a23edc16e4bb9 RAJA: 3047fa720132d19ee143b1fcdacaa72971f5988c (v0.13.0 tagged release) Umpire: 447f4640eff7b8f39d3c59404f3b03629b90c021 (v4.1.2 tagged release)

Additional information: Compiled on rzansel with gcc/7.3.1, cuda/10.1.243, and cmake/3.14.5

#include "RAJA/RAJA.hpp"

#include "umpire/strategy/DynamicPool.hpp"
#include "umpire/Allocator.hpp"
#include "umpire/ResourceManager.hpp"

#include "chai/config.hpp"
#include "chai/ExecutionSpaces.hpp"
#include "chai/ManagedArray.hpp"

int main()
{

      auto& rm = umpire::ResourceManager::getInstance();
      auto host_allocator = rm.getAllocator("HOST");
#ifdef __CUDACC__
      auto device_allocator = rm.makeAllocator<umpire::strategy::DynamicPool>
                              ("DEVICE_pool", rm.getAllocator("DEVICE"));
#endif

      const int size = 5000;

      chai::ManagedArray<double> array(size, 
      std::initializer_list<chai::ExecutionSpace>{chai::CPU
#if defined(CHAI_ENABLE_CUDA) || defined(CHAI_ENABLE_HIP)
         , chai::GPU
#endif
         },
         std::initializer_list<umpire::Allocator>{host_allocator
#if defined(CHAI_ENABLE_CUDA) || defined(CHAI_ENABLE_HIP)
         , device_allocator
#endif
      },
      chai::ExecutionSpace::GPU);

      std::cout << "Running GPU runs" << std::endl;
      // This works
      RAJA::forall<RAJA::cuda_exec<256>>(RAJA::RangeSegment(0, size),
         [=] __device__ (int i) {
            array[i] = i;
      });

      std::cout << "Running CPU runs" << std::endl;
      // This should work but fails
      // RAJA::forall<RAJA::seq_exec>(RAJA::RangeSegment(0, size),
      //    [&] (int i) {
      //       array[i] = i;
      //    });
      // This works
      RAJA::forall<RAJA::seq_exec>(RAJA::RangeSegment(0, size),
         [=] (int i) {
            array[i] = i;
         });
      std::cout << "Printing out data" << std::endl;
      // These work
      // std::cout << array.data(chai::ExecutionSpace::CPU)[0] << std::endl;
      // std::cout << array.data()[0] << std::endl;
      // This should work since we last ran things on the CPU but fails
      std::cout << array[0] << std::endl;
      array.free();
      return 0;
}

rcarson3 avatar Jul 28 '21 17:07 rcarson3

Hey Robert, Chai requires lambdas to capture managed arrays by value – it’s the resulting copy constructor call that ensures data is allocated (if necessary) on the device and moved (if necessary). Any use of a [] operator outside of a RAJA loop context is essentially undefined behavior, as CHAI makes no guarantees about having a valid pointer being dereferenced by the [] in such a case, and even if it’s valid, CHAI makes no guarantee it is pointing to live, correct data anywhere except from within a RAJA context.

The .data() calls and .pick() calls are a way to ensure that you are referencing valid data outside of a RAJA context, as you’ve observed.

On ALE3D we use CARE::host_device_ptr (which wrap chai::managed_asrray) and some clang-query checks to provide compile-time checks to ensure we do not doe this. -Peter

From: Robert Carson @.> Sent: Wednesday, July 28, 2021 10:24 AM To: LLNL/CHAI @.> Cc: Subscribed @.***> Subject: [LLNL/CHAI] Issue with ManagedArray::Operator[] and RAJA (#184)

@davidbeckingsalehttps://urldefense.us/v3/__https:/github.com/davidbeckingsale__;!!G2kpM7uM-TzIFchu!iMbukC5nz4mzMLJ1USb60tTMbEnbG8UJ4ZSbPUTKlktdipCs9c_E4WPx5S500c58TxI$ and others

I recently ran into a case for an application I'm working on where I had the initial chai::ExecutionSpace set to chai::ExecutionSpace::GPU for a chai::ManagedArray, and then I had it run within a RAJA forall loop on the CPU to initialize the data for some unit tests. Later on when trying to access the data on the host using the ManagedArray::Operator[] I would get a segfault. My understanding talking with @davidbeckingsalehttps://urldefense.us/v3/__https:/github.com/davidbeckingsale__;!!G2kpM7uM-TzIFchu!iMbukC5nz4mzMLJ1USb60tTMbEnbG8UJ4ZSbPUTKlktdipCs9c_E4WPx5S500c58TxI$ offline is that this should work as the data should now be allocated on the host/cpu.

I've included a MWE representing the issue.

Also, one other issue I noted was that doing a lambda capture by reference for the host RAJA loop also resulted in a segfault. I've included that below as well as commented out section of code. I will note that I've found this sort of lambda capture works fine if the original ExecutionSpace is set to the CPU.

Also, I'm using the following hashes for CHAI, RAJA, and Umpire: CHAI: df3e8e0https://urldefense.us/v3/__https:/github.com/LLNL/CHAI/commit/df3e8e0f1b4350622bb9a68cfc7a23edc16e4bb9__;!!G2kpM7uM-TzIFchu!iMbukC5nz4mzMLJ1USb60tTMbEnbG8UJ4ZSbPUTKlktdipCs9c_E4WPx5S50CL7a71w$ RAJA: 3047fa720132d19ee143b1fcdacaa72971f5988c (v0.13.0 tagged release) Umpire: 447f4640eff7b8f39d3c59404f3b03629b90c021 (v4.1.2 tagged release)

Additional information: Compiled on rzansel with gcc/7.3.1, cuda/10.1.243, and cmake/3.14.5

#include "RAJA/RAJA.hpp"

#include "umpire/strategy/DynamicPool.hpp"

#include "umpire/Allocator.hpp"

#include "umpire/ResourceManager.hpp"

#include "chai/config.hpp"

#include "chai/ExecutionSpaces.hpp"

#include "chai/ManagedArray.hpp"

int main()

{

  auto& rm = umpire::ResourceManager::getInstance();

  auto host_allocator = rm.getAllocator("HOST");

#ifdef CUDACC

  auto device_allocator = rm.makeAllocator<umpire::strategy::DynamicPool>

                          ("DEVICE_pool", rm.getAllocator("DEVICE"));

#endif

  const int size = 5000;



  chai::ManagedArray<double> array(size,

  std::initializer_list<chai::ExecutionSpace>{chai::CPU

#if defined(CHAI_ENABLE_CUDA) || defined(CHAI_ENABLE_HIP)

     , chai::GPU

#endif

     },

     std::initializer_list<umpire::Allocator>{host_allocator

#if defined(CHAI_ENABLE_CUDA) || defined(CHAI_ENABLE_HIP)

     , device_allocator

#endif

  },

  chai::ExecutionSpace::GPU);



  std::cout << "Running GPU runs" << std::endl;

  // This works

  RAJA::forall<RAJA::cuda_exec<256>>(RAJA::RangeSegment(0, size),

     [=] __device__ (int i) {

        array[i] = i;

  });



  std::cout << "Running CPU runs" << std::endl;

  // This should work but fails

  // RAJA::forall<RAJA::seq_exec>(RAJA::RangeSegment(0, size),

  //    [&] (int i) {

  //       array[i] = i;

  //    });

  // This works

  RAJA::forall<RAJA::seq_exec>(RAJA::RangeSegment(0, size),

     [=] (int i) {

        array[i] = i;

     });

  std::cout << "Printing out data" << std::endl;

  // These work

  // std::cout << array.data(chai::ExecutionSpace::CPU)[0] << std::endl;

  // std::cout << array.data()[0] << std::endl;

  // This should work since we last ran things on the CPU but fails

  std::cout << array[0] << std::endl;

  array.free();

  return 0;

}

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https:/github.com/LLNL/CHAI/issues/184__;!!G2kpM7uM-TzIFchu!iMbukC5nz4mzMLJ1USb60tTMbEnbG8UJ4ZSbPUTKlktdipCs9c_E4WPx5S507sGEPsM$, or unsubscribehttps://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/ABKR6UEKJCVHSYKQEES7OJ3T2A4L3ANCNFSM5BEZOUKA__;!!G2kpM7uM-TzIFchu!iMbukC5nz4mzMLJ1USb60tTMbEnbG8UJ4ZSbPUTKlktdipCs9c_E4WPx5S505CDO0YY$.

robinson96 avatar Jul 28 '21 17:07 robinson96

@robinson96 thanks for the quick response. I was unaware about the lambda capture by value requirement for the internal data transfer portion of things to work. The same goes about [] operator being undefined behavior outside of RAJA loops. I will definitely be careful about in my application codes and will need to update some libraries in regards to the above lambda capture issue.

If y'all would be open to it I can open a PR with some small documentations changes noting these requirements.

rcarson3 avatar Jul 28 '21 18:07 rcarson3

That sounds like a good idea to me. -Peter

From: Robert Carson @.> Sent: Wednesday, July 28, 2021 11:32 AM To: LLNL/CHAI @.> Cc: Robinson, Peter @.>; Mention @.> Subject: Re: [LLNL/CHAI] Issue with ManagedArray::Operator[] and RAJA (#184)

@robinson96https://urldefense.us/v3/__https:/github.com/robinson96__;!!G2kpM7uM-TzIFchu!j6_N-Jux3PWXQF_ed3QjOcIZhzWHWPgixbzf6M7t2MekJcBgVwhol5PW--xDVOm5ZJI$ thanks for the quick response. I was unaware about the lambda capture by value requirement for the internal data transfer portion of things to work. The same goes about [] operator being undefined behavior outside of RAJA loops. I will definitely be careful about in my application codes and will need to update some libraries in regards to the above lambda capture issue.

If y'all would be open to it I can open a PR with some small documentations changes noting these requirements.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https:/github.com/LLNL/CHAI/issues/184*issuecomment-888529257__;Iw!!G2kpM7uM-TzIFchu!j6_N-Jux3PWXQF_ed3QjOcIZhzWHWPgixbzf6M7t2MekJcBgVwhol5PW--xD1OY1y8U$, or unsubscribehttps://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/ABKR6UGDYCKSW2N3KQSDX3TT2BELNANCNFSM5BEZOUKA__;!!G2kpM7uM-TzIFchu!j6_N-Jux3PWXQF_ed3QjOcIZhzWHWPgixbzf6M7t2MekJcBgVwhol5PW--xDkbszxCg$.

robinson96 avatar Jul 28 '21 20:07 robinson96