stdgpu add `occupied(n)` for unordered set and map

Add bool occupied(index n) function for unordered_set and unordered_map. So that we can:

struct AllocateNewBlocks {
  AllocateNewBlocks(stdgpu::unorderd_map<xx> block_map,
                    BlockBuffer buffer_buffer,
                    stdgpu::unorderd_set<xx> not_exist_block_indices)
      : buffer_buffer(buffer_buffer),
        block_map(block_map),
        not_exist_block_indices(not_exist_block_indices) {}

  __device__ void operator()(const stdgpu::index_t index) {
    if (!not_exist_block_indices.occupied(index)) { // Used here.
      return;
    }
    const BlockIndex block_index = *(not_exist_block_indices.begin() + index);
    if (const auto& [iter, is_inserted] = block_map.emplace(block_index, 0); is_inserted) {
      iter->second = buffer_view.AllocateBlock();
    }
  }

  BlockBuffer buffer_buffer;
  stdgpu::unorderd_map<xx> block_map;
  stdgpu::unorderd_set<xx> not_exist_block_indices;
}; 


stdgpu::for_each_index(thrust::cuda::par.on(stream()),
                       not_exist_block_indices().max_size(),
                       AllocateNewBlocks(block_map, block_buffer, not_exist_block_indices);

It skips the need for getting device_range.

Aug 08 '24 16:08 tanzby

@stotko help to review this PR, thanks

Aug 08 '24 16:08 tanzby

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.

Project coverage is 97.19%. Comparing base (3b7d712) to head (87fe7d7). Report is 22 commits behind head on master.

Files with missing lines	Patch %	Lines
src/stdgpu/impl/unordered_map_detail.cuh	0.00%	2 Missing :warning:
src/stdgpu/impl/unordered_set_detail.cuh	0.00%	2 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #427      +/-   ##
==========================================
- Coverage   97.34%   97.19%   -0.16%     
==========================================
  Files          32       32              
  Lines        2524     2528       +4     
==========================================
  Hits         2457     2457              
- Misses         67       71       +4

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

Aug 08 '24 16:08 codecov[bot]

Add bool occupied(index n) function for unordered_set and unordered_map. So that we can:

struct AllocateNewBlocks {
  AllocateNewBlocks(stdgpu::unorderd_map<xx> block_map,
                    BlockBuffer buffer_buffer,
                    stdgpu::unorderd_set<xx> not_exist_block_indices)
      : buffer_buffer(buffer_buffer),
        block_map(block_map),
        not_exist_block_indices(not_exist_block_indices) {}

  __device__ void operator()(const stdgpu::index_t index) {
    if (!not_exist_block_indices.occupied(index)) { // Used here.
      return;
    }
    const BlockIndex block_index = *(not_exist_block_indices.begin() + index);
    if (const auto& [iter, is_inserted] = block_map.emplace(block_index, 0); is_inserted) {
      iter->second = buffer_view.AllocateBlock();
    }
  }

  BlockBuffer buffer_buffer;
  stdgpu::unorderd_map<xx> block_map;
  stdgpu::unorderd_set<xx> not_exist_block_indices;
}; 


stdgpu::for_each_index(thrust::cuda::par.on(stream()),
                       not_exist_block_indices().max_size(),
                       AllocateNewBlocks(block_map, block_buffer, not_exist_block_indices);

It skips the need for getting device_range.

Thanks for working on this. However, I believe that exposing the occupied function is not the right way to move forward since this function really meant as an implementation detail of the base container. Even exposing begin(), for symmetry with end() (required for find()), already gives more access to the internals than typically needed.

While the use case you mentioned is fine, it may suffer from bad performance since the load factor of unordered_map/unordered_set is typically low and thus the container is only sparsely filled, which would lead to a low thread utilization in your kernel where many threads of a warp immediately return in the if statement. That is the reason for having device_range() as it allows to densely pack all occupied values.

As mentioned in #423, adding (host-only) overloads with an additional stream argument for the load() and store() function of atomic would be better as it addresses the actual underlying problem. More concretely, I see two options here:

Also implement a stream-aware host-to-device memcpy function: Clean, but not directly straightforward to do as the stream is a template class and the internal memory management system is intentionally strongly decoupled from the rest of the library.
Simulate the memcpy with a "no-op" transform_reduce_index to let thrust do the work for us: More like a workaround and might be inefficient in terms of performance.

Aug 12 '24 13:08 stotko

The underlying issue has been fixed in #450.

Nov 20 '24 10:11 stotko