Yunsong Wang

Results 178 comments of Yunsong Wang

About 50% slower for `find` :scream:

I cannot reproduce the performance regression with my local RTX8000: ``` # static_set_find_unique_occupancy ## [0] Quadro RTX 8000 | Key | Distribution | Occupancy | Ref Time | Ref Noise...

> That would be a simple `if(std::distance(input_begin, input_end) == 0) return;`? @sleeepyjack Yeah, an early exit like that in a variadic template.

Related to asynchronous size computation #102 @esoha-nvidia Thanks for reporting this. We are aware of this issue and it will be addressed during our refactoring work #110.

Updates: this is still an experimental feature that requires to define `LIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE` https://godbolt.org/z/jfWsjarTz

ok to test

We do provide performance guidance in the probing sequence doc, e.g.: - https://github.com/NVIDIA/cuCollections/blob/4bdf6063de349be7af8da987cea743aa88e28470/include/cuco/probe_sequences.cuh#L26-L28 - https://github.com/NVIDIA/cuCollections/blob/4bdf6063de349be7af8da987cea743aa88e28470/include/cuco/probe_sequences.cuh#L52-L55 Having a performance tuning section in `README` doesn't seem right.

@PramodShenoy Thanks for reporting this. ~~`get_size` will not work properly if device-view `insert` is directly invoked by users. Currently, it's the user's responsibility to update `size` counter if they use...

@PramodShenoy To insert `n` keys with the CG algorithm, we need `n * CGSize` threads. The implementation is designed in a way that if CG size = 8, 8 threads...

The optimal CG size really depends on your use case. The general guideline is to use large CG (size equals `4` or `8`) if occupancy (> 50%) or multiplicity is...