oneAPI-spec Device API for random number generation

There is host API for generating random numbers on device in oneMKL. However, there is no device API for random number generation. The problem is we sometimes do not know how many random numbers we need before we launch a kernel that needs to use random numbers. Sometimes even when we know, the number of rand numbers needed might be too big to fit into memory.

Could oneAPI provide device API for random number generation?

May 15 '20 00:05 WeiqunZhang

@mkrainiuk: Can you look at this?

May 15 '20 00:05 rscohn2

Hi @WeiqunZhang, yes, we plan to provide device API for selected oneMKL domains including random number generators. We did some prototyping, and currently discussing API. If you have any ideas about what API will perfectly fit to your kernels, please share with us.

May 15 '20 01:05 mkrainiuk

summon @marius-cornea

May 15 '20 01:05 mkrainiuk

Something like below would work for us.

cl::sycl::queue queue(...);                                                                        
std::size_t NSTATES  = ...; // # of states in the engine                                                   
onemkl::rng::philox4x32x10 engine(queue, N, SEED);                                                 
 
auto NTHREADS = NSTATES;              
queue.submit([&] (cl::sycl::handler& h)                                                            
{                                                                                                  
    onemkl::rng::accessor<onemkl::rng::philox4x32x10> engine_accessor(engine);                     
    h.parallel_for(cl::sycl::range<1>(NTHREADS),                                                          
    [=] (cl::sycl::item<1> item)                                                                   
    {                                                                                              
        onemkl::rng::uniform<double> distr(0.0, 1.0); // uniform random double [0,1)               
        auto r = distr(engine_accessor[item.get_linear_id()]);                                                           
    });                                                                                            
});

onemkl::rng::accessor is something I made up. I imagine it might be hard to capture philox4x32x10 by value onto device because of resource ownership issue. So accessor could be something like a struct with a non-owning pointer and it can be freely copied.

Also we usually use ordered queue. In that case, we also would like to be able to construct an accessor object outside the command group scope (i.e., outside queue.submit()) instead. To be able to do this is actually very important to us.

In the code above, we assume the number NSTATES passed to the constructor of the engine is the same as NTHREADS, the number of threads used to launch the device kernel. Ideally we would like this to be relaxed to that NTHREADS is just a multiple of NSTATES, because we may launch a kernel with much more threads than the maximum number of threads the hardware can run simultaneously. Otherwise, it would waste a lot of memory and we would have to recreate the engine (which is probably very expensive) when we need to launch a second kernel with different number of threads. To get around that in our CUDA code, we have implemented a mutex so that we can launch kernels with different number of threads using the same engine. Some threads in a group (but not those that do not need to generate random numbers) call lock. If one thread locks it, the whole group owns it. These threads can then generate a number of random numbers. Finally they unlock the mutex to let another group use it. Those that call lock must call unlock, but not all threads in the group have to go through the path of lock-generate-unlock. If oneMKL can provide such a mutex, that would be perfect. So the perfect API for us would be

cl::sycl::queue queue(...);   // ordered queue
std::size_t NSTATES  = ...; // # of states in the engine                                                   
onemkl::rng::philox4x32x10 engine(queue, N, SEED);                                                 
 
onemkl::rng::accessor<onemkl::rng::philox4x32x10> engine_accessor(engine);  

auto NTHREADS = n*NSTATES;
queue.submit([&] (cl::sycl::handler& h)                                                            
{                   
    h.parallel_for(cl::sycl::range<1>(NTHREADS),                                                          
    [=] (cl::sycl::item<1> item)                                                                   
    {
        if (item.get_local_linear_id() < 5) {
            onemkl::rng::scoped_mutex mutx(engine_accessor);                                   
            onemkl::rng::uniform<double> distr(0.0, 1.0); // uniform random double [0,1)               
            auto r = distr(engine_accessor[item.get_linear_id()]);
            // some threads might generate more random nubmers
            // some might use a different distribution
         }                                                          
    });                   
});

May 15 '20 04:05 WeiqunZhang