cuCollections icon indicating copy to clipboard operation
cuCollections copied to clipboard

[BUG]: static_multimap<> insert() not working as expected

Open PramodShenoy opened this issue 1 year ago • 36 comments

Is this a duplicate?

  • [X] I confirmed there appear to be no duplicate issues for this bug (https://github.com/NVIDIA/cuCollections/issues)

Type of Bug

Runtime Error

Describe the bug

I have a kernel that tries to insert unique keys and values into the map. On querying the size of the map in the host using get_size(), I observe it is always equal to block size/size of cooperative group. It seems there is only one entry being inserted into the map per cooperative group. Behavior seems to be similar even when there are collisions.

Here is the host code -

int main(int argc, char const *argv[])
{
    using Key   = int;
    using Value = int;

    // Empty slots are represented by reserved "sentinel" values. These values should be selected such
    // that they never occur in your input data.
    Key constexpr empty_key_sentinel     = -1;
    Value constexpr empty_value_sentinel = -1;

    // Number of key/value pairs to be inserted
    std::size_t constexpr num_keys = 50000;

    // Compute capacity based on a 50% load factor
    auto constexpr load_factor = 0.5;
    std::size_t const capacity = std::ceil(num_keys / load_factor);


    // Constructs a map with "capacity" slots using -1 and -1 as the empty key/value sentinels.
    cuco::static_multimap<Key, Value> map{capacity,
                                   cuco::sentinel::empty_key{empty_key_sentinel},
                                   cuco::sentinel::empty_value{empty_value_sentinel}};

    auto device_mutable_view = map.get_device_mutable_view();
    insertMap<<<dim3(1,1,1), dim3(128,1,1)>>>(device_mutable_view);
    cudaDeviceSynchronize();
    std::cout << "Map size = " << map.get_size() << std::endl;
    thrust::device_vector<int> inputs(1);
    inputs.push_back(1);
    size_t size = map.count(inputs.begin(),inputs.end());
    std::cout << size << std::endl;
    thrust::device_vector<cuco::pair<int, int>> res(size);
    map.retrieve(inputs.begin(),inputs.end(), res.begin());
    thrust::host_vector<cuco::pair<int, int>> h_res(res);

    for(int i=0;i<size;i++)
        std::cout<< h_res[i].first << " " << h_res[i].second << "\n";

    return 0;
}

How to Reproduce

Example kernel 1 - all unique keys and values

__global__ void insertMap(cuco::static_multimap<int, int>::device_mutable_view dv) {
    unsigned int tid = threadIdx.x;
    auto block_size = cooperative_groups::this_thread_block().size();
    cooperative_groups::thread_block_tile<8U> g = tiled_partition<8U>(this_thread_block());
    cuco::pair<int,int> p;
    p.first = tid;
    p.second = tid;
    //printf("inserting key %d value = %d\n", p.first, p.second);
    dv.insert(g,p);
}

Example Kernel 2 - Single key with unique values

__global__ void insertMap(cuco::static_multimap<int, int>::device_mutable_view dv) {
    unsigned int tid = threadIdx.x;
    auto block_size = cooperative_groups::this_thread_block().size();
    cooperative_groups::thread_block_tile<8U> g = tiled_partition<8U>(this_thread_block());
    cuco::pair<int,int> p;
    p.first = 1;
    p.second = tid;
    //printf("inserting key %d value = %d\n", p.first, p.second);
    dv.insert(g,p);
}

Example Kernel 3 - Grouped keys with unique values

__global__ void insertMap(cuco::static_multimap<int, int>::device_mutable_view dv) {
    unsigned int tid = threadIdx.x;
    auto block_size = cooperative_groups::this_thread_block().size();
    cooperative_groups::thread_block_tile<8U> g = tiled_partition<8U>(this_thread_block());
    cuco::pair<int,int> p;
    p.first = tid%10;
    p.second = tid;
    //printf("inserting key %d value = %d\n", p.first, p.second);
    dv.insert(g,p);
}

In all the above cases, map.get_size() in the host returns 16.

Expected behavior

In all the above cases, it is expected for the map size to be equal to the number of threads - 128.

Reproduction link

No response

Operating System

No response

nvidia-smi output

Mon Nov 7 20:01:22 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Quadro RTX 6000 Off | 00000000:3B:00.0 Off | Off | | 31% 24C P0 27W / 260W | 0MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

NVCC version

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Mon_Nov_30_19:08:53_PST_2020 Cuda compilation tools, release 11.2, V11.2.67 Build cuda_11.2.r11.2/compiler.29373293_0

PramodShenoy avatar Nov 07 '22 20:11 PramodShenoy

@PramodShenoy Thanks for reporting this.

~~get_size will not work properly if device-view insert is directly invoked by users. Currently, it's the user's responsibility to update size counter if they use device-view insert.~~ Sorry, you are talking about multimap. Let me look into this and will respond to you shortly.

PointKernel avatar Nov 07 '22 20:11 PointKernel

@PramodShenoy To insert n keys with the CG algorithm, we need n * CGSize threads. The implementation is designed in a way that if CG size = 8, 8 threads from thread 0 to thread 7 will return at the same time if any of the threads in this group successfully finishes an insertion. The same for {thread 8, ... thread 15}, {thread 16, ... thread 23}, etc. The simple way to remember this is 8 threads for one hash map entry. As a result, when the number of total threads is 128 and CG size is 8, you are always inserting 128 / 8 = 16 keys. That's also the reason that the loop stride here is (gridDim.x * block_size) / tile_size as opposed to gridDim.x * block_size.

Make sense?

PointKernel avatar Nov 07 '22 23:11 PointKernel

Thanks @PointKernel for the explanation! This makes sense now. In our case, if we want each thread block to insert 128 keys, we would need 128 * CGSize threads per thread block. So for this case, when should we use 128 thread block size with CGSize = 1 and when should we use 1024 thread block size with CGSize = 8? Is there any workload pattern that you have observed before?

bwyogatama avatar Nov 07 '22 23:11 bwyogatama

The optimal CG size really depends on your use case. The general guideline is to use large CG (size equals 4 or 8) if occupancy (> 50%) or multiplicity is high and small ones otherwise. CG size = 1 can be beneficial if the occupancy is really low (< 10%).

Yours to discover.

PointKernel avatar Nov 07 '22 23:11 PointKernel

Can you explain what is occupancy and multiplicity here?

bwyogatama avatar Nov 07 '22 23:11 bwyogatama

Thanks @PointKernel ! Can you please confirm if this line is logically correct if we need to override the Cooperative Group size -

cuco::static_multimap<Key, Value, cuda::thread_scope::thread_scope_device,
                                   cuco::cuda_allocator<char>,
                                   cuco::double_hashing<4, cuco::detail::MurmurHash3_32<Key>, cuco::detail::MurmurHash3_32<Key>>> 
                                   map{capacity,
                                   cuco::sentinel::empty_key{empty_key_sentinel},
                                   cuco::sentinel::empty_value{empty_value_sentinel}};

PramodShenoy avatar Nov 07 '22 23:11 PramodShenoy

Can you explain what is occupancy and multiplicity here?

Oh, if the hash map capacity is 100 and you insert 60 keys, the occupancy is 60%. If the 60 keys are a sequence from 0 to 59 (all unique), the multiplicity is 1. But if they are all of the same value, the multiplicity will be 60.

PointKernel avatar Nov 07 '22 23:11 PointKernel

Thanks @PointKernel ! Can you please confirm if this line is logically correct if we need to override the Cooperative Group size -

cuco::static_multimap<Key, Value, cuda::thread_scope::thread_scope_device,
                                   cuco::cuda_allocator<char>,
                                   cuco::double_hashing<4, cuco::detail::MurmurHash3_32<Key>, cuco::detail::MurmurHash3_32<Key>>> 
                                   map{capacity,
                                   cuco::sentinel::empty_key{empty_key_sentinel},
                                   cuco::sentinel::empty_value{empty_value_sentinel}};

This should work just fine. Note we need to use the same CG size consistently for all operations of a hash map. Using cgsize=2 for insert and cgsize=4 for retrieve will definitely crash at runtime.

PointKernel avatar Nov 07 '22 23:11 PointKernel

Understood. Thanks a lot!

PramodShenoy avatar Nov 07 '22 23:11 PramodShenoy

Hello @PointKernel
Sorry to continue on this closed thread but I think it is related to the use case. What is the expected behavior of multimap if static_multimap::device_mutable_view::insert() is called in a loop in the kernel as done for block wide functions?

My observation is that duplicate values are being inserted. Is there a defined behavior for this case?

PramodShenoy avatar Nov 17 '22 21:11 PramodShenoy

My observation is that duplicate values are being inserted. Is there a defined behavior for this case?

That's correct. All key-value pairs will be inserted in the multimap as long as it still has empty slots. If you expect only unique key-value pairs are present in the map, cuco::static_map is what you are looking for.

PointKernel avatar Nov 17 '22 21:11 PointKernel

However, the same key-value pair should not be inserted twice right? I am seeing the same key-value pair is inserted twice and they are the only entries in the cuco::multi_map<>

PramodShenoy avatar Nov 17 '22 21:11 PramodShenoy

However, the same key-value pair should not be inserted twice right? I am seeing the same key-value pair is inserted twice and they are the only entries in the cuco::multi_map<>

If you call device_mutable_view::insert twice with the same key/value, then the key/value pair will appear twice in the multimap.

This is the important difference between the static_map and static_multimap.

jrhemstad avatar Nov 17 '22 21:11 jrhemstad

https://godbolt.org/z/esqYPK98T

Take std::unordered_multimap for example, multimap in general doesn't guarantee "uniqueness".

PointKernel avatar Nov 17 '22 21:11 PointKernel

Thank you @PointKernel and @jrhemstad, I just realized that. I was under the impression std::multimap<> ensures Key-Value pair uniqueness.

Do you have any recommendations to tackle this problem -

We have some block wide functions where each thread in the block handles some distinct values. How can we ensure that ALL of these values are inserted into the map? For example, if we have a block of size 128 with each thread handling 4 items, we would want all 512 items to be inserted in the map. Do you have any recommendations for this kind of problem?

PramodShenoy avatar Nov 17 '22 21:11 PramodShenoy

we would want all 512 items to be inserted in the map

Are all 512 keys unique or not?

jrhemstad avatar Nov 17 '22 21:11 jrhemstad

The keys may not be unique, but the Key-Value pair that we insert would be unique always. We would not want the same Key-Value pair being inserted multiple times either as that would affect a lot of other stuff.

PramodShenoy avatar Nov 17 '22 22:11 PramodShenoy

It sounds like you want to treat the entire key/value pair as if it were the key. In other words, it sounds like you want a set instead of a map.

jrhemstad avatar Nov 17 '22 22:11 jrhemstad

In an ideal scenario, we would want to store a key and a list of values associated with it. If we use a set, then would have to store the Key-Value pair as a single entity right?

PramodShenoy avatar Nov 17 '22 23:11 PramodShenoy

Hm, I'm confused.

If you want to store multiple values associated with the same key, then you'd want a multi map. But you also to want to avoid inserting identical key/value pairs (but not identical keys)?

So if you were to insert the following pairs:

{13, 42}
{13, 1}
{9, 18}
{13, 42}

What would you want the contents of the map to be?

jrhemstad avatar Nov 18 '22 00:11 jrhemstad

Hi Jake,

This is Bobbi. Basically, we want to implement hash join and try to use cuCollection to implement the hashing part. What we have is a pair of join key and row ID of the table. Join key does not guaranteed to be unique. Row ID is guaranteed to be unique. So for example if we have:

{Join Key, Row ID}\ {13, 42}\ {13, 1}\ {9, 18}\ {13, 3}

We would want the contents of the hash table to be: 9 -> 18 13-> 1, 3, 42

Can we achieve this with cuCollection? If not, then what is the recommended approach for this?

Thanks

bwyogatama avatar Nov 18 '22 00:11 bwyogatama

Hm, I'm confused.

If you want to store multiple values associated with the same key, then you'd want a multi map. But you also to want to avoid inserting identical key/value pairs (but not identical keys)?

So if you were to insert the following pairs:

{13, 42}
{13, 1}
{9, 18}
{13, 42}

What would you want the contents of the map to be?

In this case, we would want only one entry of {13,42} otherwise they would mess up other operations. Also, in the case of Bobbi's example, I think we would be fine if we can insert repeating keys separately like this:

9 -> 18 
13-> 1, 
13 ->3, 
13 -> 42

PramodShenoy avatar Nov 18 '22 04:11 PramodShenoy

Basically, we want to implement hash join

That would indeed require a multimap. A simple hash join (unpartitioned inner equi-join) A⋈B with cuco works as follows:

    1. Build a cuco::static_multimap using table A: map.insert(A.begin(), A.end())
    1. Use the keys of table B and retrieve all of their matching values from A: map.retrieve(B.keys.begin(), B.keys.end(), result)

sleeepyjack avatar Nov 18 '22 14:11 sleeepyjack

I think we would be fine if we can insert repeating keys separately like this:

9 -> 18 
13-> 1, 
13 ->3, 
13 -> 42

Since cuco doesn't support set yet, you can use a cuco::static_map and take a pair of join key and row id as hash map keys and discard the value part. Custom key equal is required then.

PointKernel avatar Nov 18 '22 14:11 PointKernel

Thanks a lot for the suggestions!

Our system basically uses CUB like block wide functions with 128 threads and each thread handling 4 items. This is used for functions like loading a partition of data, applying predicates, etc. In terms of Cooperative groups of size 8, we could have 16 groups meaning 16 insertions into our hash table. Do you think there's a way we could do 512 insertions into the map without loading the selected portion of the dataset separately into memory?

The current implementation (without cuco) does something like this - each block loads a tile of the dataset into a shared array and applies predicates (eq, neq, gt, etc.). If the row is selected, we can insert it into the hash table for further join. In case all 512 items per block are selected, we will have 512 insertions done by this block which would be 16/32 via cuco::static_multimap<> if using CG size 8/4 respectively.

PramodShenoy avatar Nov 18 '22 16:11 PramodShenoy

What we have is a pair of join key and row ID of the table. Join key does not guaranteed to be unique. Row ID is guaranteed to be unique.

Ah, okay. Now this makes more sense. So this is very similar to how cuDF's hash join works.

We build a hash table where the key is the hash of the row, and the value is the row index. We insert every {row_hash, row_index} pair. There's no need to worry about duplicates because like you said, every row_index is guaranteed to be unique.

Later, in the probing step, we use a special function static_multimap::pair_retrieve that considers the entire pair for equality and it uses a custom equality function that first compares the row_hash, and if those match, then does a full lexicographic comparison of the rows referenced by the row_index value.

jrhemstad avatar Nov 18 '22 22:11 jrhemstad

Hi @jrhemstad @PointKernel ,

So it seems that it's possible to achieve this with cuCollection. I have a few questions as I try to understand the design.

  1. So in multimap, it seems that we have (by default) 8 threads in a group to hash the same key into the hash table. And once one of this 8 threads manage to insert, the other 7 threads will return. Yunsong mentioned that this will improve the performance as opposed to having each thread hashing different key into the hash table. Can someone explain to me why this improve the performance?
  2. I'm trying to understand Jake's example here. What is row_hash in this case? is it the join key or is it something else? And why do we need to match the entire pair for equality in hash join? Shouldn't we only match the join key?

bwyogatama avatar Nov 22 '22 01:11 bwyogatama

  1. Check out @sleeepyjack 's paper on cooperative insertion/probing: https://arxiv.org/pdf/2009.07914.pdf
  2. row_hash is the hash value of the row. The strategy I described is necessary to support a join criteria on multiple columns. If you're joining on multiple columns, then you can't fit the entire row as the key of the hash table. If you're just joining on a single column of ints, then you can avoid this complexity.

jrhemstad avatar Nov 22 '22 02:11 jrhemstad

Hi @PointKernel @jrhemstad

I was trying to use the device_view<> version of retrieve() and needed some help. Is there an example usage of the API that I can refer to? It seems we need to use 2 Cooperative groups - one for probe and another for flushing to the output buffer, is this the right understanding? Some more questions from the declaration in the header file -

  • What is the flushing_cg_counter and what type are we expecting?
  • Should num_matches be of std::atomic<> type?
  • There are 2 output parameters - output_buffer and output_begin. How are they different and is it necessary for them to be declared in shared memory?

Also, will the retrieve() populate 1 item per CG similar to insert?

Thanks a lot for your help!

PramodShenoy avatar Nov 25 '22 00:11 PramodShenoy

Is there an example usage of the API that I can refer to?

You could refer to retrieve kernel for the use of device_view::retrieve. The retrieve kernel is invoked here.

It seems we need to use 2 Cooperative groups - one for probe and another for flushing to the output buffer

Depending on the size of key/value pairs:

  • if they are no bigger than 8 bytes. The implementation uses a per-warp shared memory buffer to write the output. i.e. flushing CG (which is a warp) size = 32 is always larger than probing CG size (which could be 2, 4, 8, etc). So multiple probing CGs write into the same warp-wise buffer.
  • otherwise, probing CG size and flushing CG size is the same and each CG writes into its own buffer.

We introduced this complexity mainly for performance purposes since performance for small pairs is more likely bounded by the flushing operation (write buffer, update atomics, etc). Thus we do per-warp flush as opposed to per-CG one to reduce the number of flushing operations and improve performance.

  • What is the flushing_cg_counter and what type are we expecting?
  • Should num_matches be of std::atomic<> type?
  • There are 2 output parameters - output_buffer and output_begin. How are they different and is it necessary for them to be declared in shared memory?

Please check the retrieve kernel implementation

Also, will the retrieve() populate 1 item per CG similar to insert?

Yes

PointKernel avatar Nov 25 '22 17:11 PointKernel