cub
cub copied to clipboard
Adds DeviceBatchMemcpy algorithm and tests
Algorithm Overview
The DeviceBatchMemcpy
takes N
input buffers and N
output buffers and copies buffer_size[i]
bytes from the i
-th input buffer to the i
-th output buffer. If any input buffer aliases memory from any output buffer the behavior is undefined. If any output buffer aliases memory of another output buffer the behavior is undefined. Input buffers can alias one another.
Implementation Details
We distinguish each buffer by its size and assign it to one of three size classes:
- Thread-level buffer (TLEV buffer). A buffer that is processed by one or more threads but not a whole warp (e.g., up to
32
bytes). - Warp-level buffer (WLEV buffer). A buffer that is processed by a whole warp (e.g., above
32
bytes but only up to1024
bytes). - Block-level buffer (BLEV buffer). A buffer that is processed by one or more thread blocks. The number of thread blocks assigned to such a buffer is proportional to its size (e.g., all buffers above
1024
bytes).
Step 1: Partitioning Buffers by Size
- Each thread block loads a tile of
buffer_size[i]
. - Threads compute a three-bin histogram over their assigned
buffer_size[ITEMS_PER_THREAD]
chunk. Binning buffers by the size class they fall into - An exclusive prefix sum is computed over the histograms. The prefix sum's aggregate reflects the number of buffers that fall into each size class. The prefix sum of each thread corresponds to the relative offset within each partition.
- Scatter the buffers into their partition. For each buffer, we scatter the tuple:
{tile_buffer_id, buffer_size}
, wheretile_buffer_id
is the buffer id, relative to the tile (i.e., from the interval[0, TILE_SIZE)
).buffer_size
is only defined for buffers that belong to thetlev
partition and corresponds to the buffer's size (number of bytes) in that case.
tile_buffer_id | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|---|
tile_buffer_sizes | 3 | 37 | 17 | 4 | 9 | 4242 | 11 | 2000 | |
T | T | T | T | T | W | B | B | ||
tile_buffer_id | 0 | 2 | 3 | 4 | 6 | 1 | 5 | 7 | |
tile_buffer_size | 3 | 17 | 4 | 9 | 11 | - | - | - |
Note, the partitioning does not necessarily need to be stable. It may be desired if we expect neighbouring buffers to hold neighbouring byte segments.
After the partitioning, each partition represents all the buffers that belong to the respective size class (i.e., one of TLEV
, WLEV
, BLEV
). Depending on the size class, a different logic is applied. We process each partition separately.
Step 2.a: Copying TLEV Buffers
Usually, TLEV buffers are buffers of only a few bytes. Vectorised loads and stores do not really pay off here, as there's only few bytes that can actually be read from a four byte-aligned address. It does not pay off to have the two different code paths for (a) loading individual bytes from non-aligned adrresses and (b) doing vectorised loads from aligned addresses.
Instead, we use the BlockRunLengthDecode
algorithm to both (a) coalesce reads and writes as well as (b) load balance the number of bytes copied by each thred. Specifically, we are able to assign neighbouring bytes to neighbouring threads.
The following tables illustrates how the first 8
bytes from the TLEV buffers are getting assigned to threads.
T | T | T | T | T | |||||
---|---|---|---|---|---|---|---|---|---|
tile_buffer_id | 0 | 2 | 3 | 4 | 6 | ||||
tile_buffer_size | 3 | 17 | 4 | 9 | 11 | - | - | - | |
[1] run_length_decode | |||||||||
t0 | t1 | t2 | t3 | t4 | |||||
buffer_id | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | |
byte_of_buffer | 0 | 1 | 2 | 0 | 1 | 2 | 3 | 4 |
[1] Use BlockRunLengthDecode
using the tile_buffer_id
as the "unique_items
" and each buffer's size as the respective run's length. The result from the run-length decode yields the assignment from threads to the buffer along with the specific byte from that buffer.
Step 2.b: Copying WLEV Buffers
A full warp is assigned to each WLEV buffer. Loads from the input buffer are vectorised (aliased to a wider data type), loading 4
, 8
or even 16
bytes at a time from the input buffer's first address that is aligned to such aliased data type. The implementation for the vectorised copy is based on @gaohao95's (thanks!) string gather improvement in https://github.com/rapidsai/cudf/pull/7980/files
I think we want to have the vectorised copy as a reusable component. But I wanted to coordinate on what exactly that would look like first. Should this be (a) a warp-/block-level copy or should we (b) separate it into a warp-&block-level vectorised load (which will also have the async copy, maybe) and a warp-&block-level vectorised store?
Step 2.c: Enqueueing BLEV Buffers
These are buffers that may be very large. We want to avoid a scenario where there's potentially one very large buffer that a single thread block is copying while other thread blocks are sitting idle. To avoid this, BLEV buffers will be put into a queue that will be picked up in a subsequent kernel. In the subsequent kernel, the number of thred blocks getting assigned to each buffer is proportional to the buffer's size.
Thanks for the feedback and the preliminary evaluation, @senior-zero 👍
Fundamentally, our ideas are quite similar. You do a three-way partition on all the problems. I proposed to have a kernel-fused version of the "three-way partitioning" that is fused with the implementation for copying small and medium buffers. The goal being that we can solve small and medium buffers straight in the kernel instead of having to write them into a "queue" first and later read them back in. I wanted to circumvent the extra reads of the problems' sizes and writing their id out, as well as another extra read of the partitioned id. This definitely makes the implementation more complex and, I totally agree, I'm not sure if that complexity is worth it.
When I had conceived this, I assumed the "worst case" scenario. In theory, let's assume these type sizes: buffer_src
, buffer_dst
, buffer_size
is each 4 bytes. The average buffer size is 4 bytes. For N buffers the fused version incurs (4 + 4 + 4 + 2 * 4) * N memory transfers. If we did a preliminary three-way partition upfront, it would be (4 + 4) * N + (4 + 4 + 4 + 4 + 2 * 4) * N. So 20 bytes per buffer versus 32 bytes per buffer, basically an extra read of buffer_size
, a write of buffer_id
, and another read of buffer_id
.
Now, we also see that, unfortunately, we cannot sustain anywhere near peak memory bandwidth for such tiny buffers. So question is whether we want to take the theoretical model into consideration at all.
I see the three decisions we need to make: (1) I think whether to kernel-fuse or to not kernel-fuse is the key decision we have to make. We'll probably need an apples-to-apples comparison that has identical implementations for the "small" buffer logic to see the performance difference. I'll try to evaluate this in the coming days. Then we can make an educated decision about code complexity versus performance.
(2) What I also like is using atomics for the scheduling/load-balancing of large buffers. The performance drop you see going from 1KB to 2KB buffers is a combination of a configuration discrepancy (my bad) and general performance regression when the tile size (or "task" size, i.e., the most granular unit getting assigned to thread blocks) is too small. The binary search seems to dominate in that case. I also want to see if streaming reads and writes will alleviate this. So we'll also need to compare these two mechanisms and factor out other side effects too.
(3) What is left, is the actual implementation of how we're copying small buffers, medium buffers, and large buffers, respectively. I think this it is easy to exchange one for the other. Once we figured out the former two decisions, this will be easy.
So I would proceed in that order. Does that sound good?
As for:
uses two times less memory
That can easily be done for the kernel-fused version too, right? It's just a matter of trading memory for more coalesced accesses. I.e., I'm materialising the buffer's source and destination pointers for large buffers instead of having the indirection. I'm also fine to have indirection in this particular case.
is faster in some cases
I'm all in for fast 😁 We just need to have a more differentiated and elaborate evaluation to track down where the difference actually comes from.
almost fits into the screen 😄
💯
uses existing facilities.
I'm all in for using existing building blocks. The problem is that I didn't assume the pointers to be aligned and so had to devise special treatment to be able to vectorise some loads/stores. If we can get the performance from existing building blocks, let's go for that. Otherwise let's make it a reusable building block.
I'm all in for using existing building blocks. The problem is that I didn't assume the pointers to be aligned and so had to devise special treatment to be able to vectorise some loads/stores. If we can get the performance from existing building blocks, let's go for that. Otherwise let's make it a reusable building block.
I've long wanted a cuda::memcpy
that would handle runtime determined alignment as well as take a CG parameter to use multiple threads to perform the copy. That seems like the best place to put such a building block as it could have widespread applicability.
I'm currently gathering results of a few more benchmarks that hopefully will help us make an informed decision about which of the scheduling mechanisms to pursue (preliminary three-way partition vs. single-pass prefix scan-based). I'll post the results shortly.
In the meanwhile, PR https://github.com/NVIDIA/cub/pull/354, on which this PR builds, should be ready for review.
FYI, I'm starting the 1.15 RC next week so I'm bumping this to 1.16. I'll try to get to NVIDIA/cccl#1006 before the release.
So I ran the first batch of benchmarks. I'll add more throughout the week.
Methodology
- Benchmarks ran on V100
- We allocate two large buffers on device memory: one for the input, one for the output
- We generate an array of
buffer_sizes
. Buffer sizes are uniform random in the interval[<Min. buffer size>, <max. buffer size>]
- We generate an offsets array for the input buffer batch, which will alias into the input memory allocation, and an offsets array for the output buffer batch, which will alias into the output memory allocation.
- These offsets can be generated one of two ways (depending on the experiment):
-
CONSECUTIVE
(C
):offset[0] = 0; offset[i] = offset[i-1] + buffer_sizes[i-1];
-
SHUFFLE
(S
): the offsets are "somewhat" similar toCONSECUTIVE
but then the offsets are being shuffled. That makes sure that bytes ofbuffer[i]
are at a different location thanbuffer[j]
fori!=j
.
-
- Further, offsets and sizes would be made to comply with a configurable
AtomicT
. That is, offsets will be aligned to integer multipes ofAtomicT
andbuffer_sizes
will as well be integer multiples ofAtomicT
. - The charts label the achieve memory throughput on the y-axis (i.e., all the required memory transfers, such as reading buffer sizes, reading buffer offsets, reading the bytes to be copied, and writing the bytes to be copied, divided by the total run time)
- The charts label the input on the x-axis:
<INPUT-OFFSET-GEN>_<OUTPUT-OFFSET-GEN>_<Min. buffer size>_<max. buffer size>
- For instance, the label
C_S_1_8
means: -
C
: the input will be consecutive buffers in -
S
: the output buffers are shuffled ("random" writes) -
1
: the minimum buffer size is 1 -
8
: the maximum buffer size is 8
- For instance, the label
- Generally, we compared different aspects of the three-way partition implementation (TWP) (see here) versus the single-pass prefix scan-based implementation (SPPS) (this PR, originally).
- This the branch that benchmarks are run on:
Compilation example / details
nvcc -DTWP_TLEV_CPY=0 -DLD_VECTORIZED=0 -DATOMIC_CPY_TYPE=uint8_t -DTLEV_ALIAS_TYPE=uint8_t -DWLEV_MIN_SIZE=17000 -DBLEV_MIN_SIZE=17000 -Xptxas -v -lineinfo --generate-code arch=compute_70,code=sm_70 -DTHRUST_IGNORE_CUB_VERSION_CHECK -I<your-thrust-path> -I<your-cub-path> test_device_batch_memcpy.cu -o test_memcpy && ./test_memcpy
-
TWP_TLEV_CPY
whether to use TWP's small buffer copying logic inside of SPPS -
LD_VECTORIZED
whether to enable CUB vectorized loads inside TWP's copy logic -
ATOMIC_CPY_TYPE
buffers will be aligned and their size an integer multiple of this type -
TLEV_ALIAS_TYPE
the data type being copied. This may not exceedATOMIC_CPY_TYPE
Copying of small buffers logic
- For these tests the thresholds for medium (aka "WLEV" buffers) and large (aka "BLEV") buffers was set so high that all buffers would be copied by the
copy small buffer
orcopy TLEV buffer
logic, respectively. - The benchmarks serve two purposes:
- identify which implementation to choose for the
copy small buffer
logic. - get an initial idea of the "scheduling overhead" (the "scheduling" being the logic that partitions the buffers into "small", "medium", and "large" buffers.
- identify which implementation to choose for the
- Both implementations were adapted and made configurable to perform aliased loads of
TLEV_ALIAS_TYPE
. VariousTLEV_ALIAS_TYPE
were tested. - The "copy small buffer logic" from TWP was ported into SPPS. This allowed to factor out performance differences due to scheduling differences. Similarly, it allowed to compare the scheduling overhead.
No Aliased Loads, No Buffer Size Variance
using AtomicT=uint8_t; using TLEV_ALIAS_TYPE=uint8_t;
Data
Min. buffer size: | max. buffer size: | in_gen: | out_gen: | src size: | dst size: | sizes size: | data_size: | total: | duration (SPPS): | BW (SPPS): | duration (TWP): | BW (TWP): | relative performance |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 2 | CONSECUTIVE | CONSECUTIVE | 0.357914 | 0.357914 | 0.357914 | 0.333333 | 1.333333 | 4.621500 | 309.781000 | 9.954430 | 143.821000 | 46.43% |
4 | 4 | CONSECUTIVE | CONSECUTIVE | 0.214748 | 0.214748 | 0.214748 | 0.400000 | 1.000000 | 3.467490 | 309.660000 | 6.699330 | 160.276000 | 51.76% |
8 | 8 | CONSECUTIVE | CONSECUTIVE | 0.119305 | 0.119305 | 0.119305 | 0.444444 | 0.777778 | 2.687550 | 310.741000 | 3.803490 | 219.570000 | 70.66% |
16 | 16 | CONSECUTIVE | CONSECUTIVE | 0.063161 | 0.063161 | 0.063161 | 0.470588 | 0.647059 | 2.201060 | 315.655000 | 2.124030 | 327.102000 | 103.63% |
32 | 32 | CONSECUTIVE | CONSECUTIVE | 0.032538 | 0.032538 | 0.032538 | 0.484848 | 0.575758 | 1.669120 | 370.384000 | 1.985600 | 311.349000 | 84.06% |
64 | 64 | CONSECUTIVE | CONSECUTIVE | 0.016519 | 0.016519 | 0.016519 | 0.492308 | 0.538462 | 1.382300 | 418.264000 | 2.191580 | 263.813000 | 63.07% |
128 | 128 | CONSECUTIVE | CONSECUTIVE | 0.008324 | 0.008324 | 0.008324 | 0.496124 | 0.519380 | 1.289440 | 432.498000 | 3.027520 | 184.204000 | 42.59% |
256 | 256 | CONSECUTIVE | CONSECUTIVE | 0.004178 | 0.004178 | 0.004178 | 0.498054 | 0.509727 | 1.540160 | 355.363000 | 3.386910 | 161.597000 | 45.47% |
512 | 512 | CONSECUTIVE | CONSECUTIVE | 0.002093 | 0.002093 | 0.002093 | 0.499024 | 0.504872 | 1.684000 | 321.914000 | 3.484130 | 155.592000 | 48.33% |
1024 | 1024 | CONSECUTIVE | CONSECUTIVE | 0.001048 | 0.001048 | 0.001048 | 0.499512 | 0.502439 | 1.548160 | 348.471000 | 5.698080 | 94.679100 | 27.17% |
4096 | 4096 | CONSECUTIVE | CONSECUTIVE | 0.000262 | 0.000262 | 0.000262 | 0.499878 | 0.500610 | 3.697090 | 145.392000 | 5.926180 | 90.703700 | 62.39% |
2 | 2 | CONSECUTIVE | SHFL | 0.357914 | 0.357914 | 0.357914 | 0.333333 | 1.333333 | 49.238400 | 29.076000 | 56.410400 | 25.379300 | 87.29% |
4 | 4 | CONSECUTIVE | SHFL | 0.214748 | 0.214748 | 0.214748 | 0.400000 | 1.000000 | 29.612200 | 36.260200 | 34.172600 | 31.421100 | 86.65% |
8 | 8 | CONSECUTIVE | SHFL | 0.119305 | 0.119305 | 0.119305 | 0.444444 | 0.777778 | 17.402500 | 47.989100 | 20.556500 | 40.626100 | 84.66% |
16 | 16 | CONSECUTIVE | SHFL | 0.063161 | 0.063161 | 0.063161 | 0.470588 | 0.647059 | 9.428540 | 73.688400 | 10.772400 | 64.495900 | 87.53% |
32 | 32 | CONSECUTIVE | SHFL | 0.032538 | 0.032538 | 0.032538 | 0.484848 | 0.575758 | 3.217150 | 192.162000 | 8.831460 | 70.001500 | 36.43% |
64 | 64 | CONSECUTIVE | SHFL | 0.016519 | 0.016519 | 0.016519 | 0.492308 | 0.538462 | 1.791100 | 322.800000 | 8.575740 | 67.419100 | 20.89% |
128 | 128 | CONSECUTIVE | SHFL | 0.008324 | 0.008324 | 0.008324 | 0.496124 | 0.519380 | 1.414910 | 394.145000 | 8.726940 | 63.903200 | 16.21% |
256 | 256 | CONSECUTIVE | SHFL | 0.004178 | 0.004178 | 0.004178 | 0.498054 | 0.509727 | 1.517700 | 360.623000 | 8.655330 | 63.234500 | 17.53% |
512 | 512 | CONSECUTIVE | SHFL | 0.002093 | 0.002093 | 0.002093 | 0.499024 | 0.504872 | 1.620580 | 334.512000 | 8.639520 | 62.746800 | 18.76% |
1024 | 1024 | CONSECUTIVE | SHFL | 0.001048 | 0.001048 | 0.001048 | 0.499512 | 0.502439 | 1.554850 | 346.972000 | 8.712640 | 61.920300 | 17.85% |
4096 | 4096 | CONSECUTIVE | SHFL | 0.000262 | 0.000262 | 0.000262 | 0.499878 | 0.500610 | 3.682690 | 145.960000 | 8.768610 | 61.301200 | 42.00% |
2 | 2 | SHFL | CONSECUTIVE | 0.357914 | 0.357914 | 0.357914 | 0.333333 | 1.333333 | 18.644400 | 76.787500 | 26.246000 | 54.547600 | 71.04% |
4 | 4 | SHFL | CONSECUTIVE | 0.214748 | 0.214748 | 0.214748 | 0.400000 | 1.000000 | 12.255600 | 87.612000 | 16.030800 | 66.979900 | 76.45% |
8 | 8 | SHFL | CONSECUTIVE | 0.119305 | 0.119305 | 0.119305 | 0.444444 | 0.777778 | 7.418370 | 112.576000 | 9.342270 | 89.392900 | 79.41% |
16 | 16 | SHFL | CONSECUTIVE | 0.063161 | 0.063161 | 0.063161 | 0.470588 | 0.647059 | 4.461890 | 155.713000 | 5.186690 | 133.953000 | 86.03% |
32 | 32 | SHFL | CONSECUTIVE | 0.032538 | 0.032538 | 0.032538 | 0.484848 | 0.575758 | 2.757440 | 224.199000 | 3.560380 | 173.637000 | 77.45% |
64 | 64 | SHFL | CONSECUTIVE | 0.016519 | 0.016519 | 0.016519 | 0.492308 | 0.538462 | 1.771780 | 326.322000 | 3.488540 | 165.734000 | 50.79% |
128 | 128 | SHFL | CONSECUTIVE | 0.008324 | 0.008324 | 0.008324 | 0.496124 | 0.519380 | 1.511550 | 368.945000 | 4.146560 | 134.492000 | 36.45% |
256 | 256 | SHFL | CONSECUTIVE | 0.004178 | 0.004178 | 0.004178 | 0.498054 | 0.509727 | 1.510910 | 362.242000 | 4.460800 | 122.694000 | 33.87% |
512 | 512 | SHFL | CONSECUTIVE | 0.002093 | 0.002093 | 0.002093 | 0.499024 | 0.504872 | 1.647780 | 328.990000 | 4.812100 | 112.654000 | 34.24% |
1024 | 1024 | SHFL | CONSECUTIVE | 0.001048 | 0.001048 | 0.001048 | 0.499512 | 0.502439 | 1.550660 | 347.910000 | 6.096960 | 88.485000 | 25.43% |
4096 | 4096 | SHFL | CONSECUTIVE | 0.000262 | 0.000262 | 0.000262 | 0.499878 | 0.500610 | 3.689700 | 145.683000 | 5.451840 | 98.595400 | 67.68% |
2 | 2 | SHFL | SHFL | 0.357914 | 0.357914 | 0.357914 | 0.333333 | 1.333333 | 55.553100 | 25.770900 | 62.612900 | 22.865200 | 88.72% |
4 | 4 | SHFL | SHFL | 0.214748 | 0.214748 | 0.214748 | 0.400000 | 1.000000 | 33.896400 | 31.677200 | 41.284300 | 26.008500 | 82.10% |
8 | 8 | SHFL | SHFL | 0.119305 | 0.119305 | 0.119305 | 0.444444 | 0.777778 | 19.720600 | 42.348100 | 22.110000 | 37.771800 | 89.19% |
16 | 16 | SHFL | SHFL | 0.063161 | 0.063161 | 0.063161 | 0.470588 | 0.647059 | 10.819900 | 64.212600 | 12.307200 | 56.452800 | 87.92% |
32 | 32 | SHFL | SHFL | 0.032538 | 0.032538 | 0.032538 | 0.484848 | 0.575758 | 4.889980 | 126.425000 | 11.500200 | 53.757100 | 42.52% |
64 | 64 | SHFL | SHFL | 0.016519 | 0.016519 | 0.016519 | 0.492308 | 0.538462 | 2.364290 | 244.542000 | 11.406400 | 50.688100 | 20.73% |
128 | 128 | SHFL | SHFL | 0.008324 | 0.008324 | 0.008324 | 0.496124 | 0.519380 | 1.471460 | 378.999000 | 11.314400 | 49.289500 | 13.01% |
256 | 256 | SHFL | SHFL | 0.004178 | 0.004178 | 0.004178 | 0.498054 | 0.509727 | 1.523420 | 359.267000 | 11.319400 | 48.351900 | 13.46% |
512 | 512 | SHFL | SHFL | 0.002093 | 0.002093 | 0.002093 | 0.499024 | 0.504872 | 1.624220 | 333.761000 | 11.340800 | 47.801000 | 14.32% |
1024 | 1024 | SHFL | SHFL | 0.001048 | 0.001048 | 0.001048 | 0.499512 | 0.502439 | 1.547520 | 348.615000 | 11.365800 | 47.465900 | 13.62% |
4096 | 4096 | SHFL | SHFL | 0.000262 | 0.000262 | 0.000262 | 0.499878 | 0.500610 | 3.674820 | 146.273000 | 11.286400 | 47.625900 | 32.56% |
Scheduling: TWP vs. SPPS; No Aliased Loads, No Buffer Size Variance
using AtomicT=uint8_t; using TLEV_ALIAS_TYPE=uint8_t;
Here, the small buffer copying logic from TWP was moved into SPPS. Hence, we aim to limit the difference to be the scheduling (i.e., the partitioning into small, medium, and large buffers).
Data
Min. buffer size: | max. buffer size: | in_gen: | out_gen: | src size: | dst size: | sizes size: | data_size: | total: | duration (SPPS): | BW (SPPS): | duration (TWP): | BW (TWP): | relative performance |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 2 | CONSECUTIVE | CONSECUTIVE | 0.357914 | 0.357914 | 0.357914 | 0.333333 | 1.333333 | 5.986530 | 239.146000 | 10.124700 | 141.403000 | 59.13% |
4 | 4 | CONSECUTIVE | CONSECUTIVE | 0.214748 | 0.214748 | 0.214748 | 0.400000 | 1.000000 | 4.014370 | 267.475000 | 6.655810 | 161.324000 | 60.31% |
8 | 8 | CONSECUTIVE | CONSECUTIVE | 0.119305 | 0.119305 | 0.119305 | 0.444444 | 0.777778 | 2.444420 | 341.649000 | 3.843840 | 217.265000 | 63.59% |
16 | 16 | CONSECUTIVE | CONSECUTIVE | 0.063161 | 0.063161 | 0.063161 | 0.470588 | 0.647059 | 1.522910 | 456.214000 | 2.134660 | 325.474000 | 71.34% |
32 | 32 | CONSECUTIVE | CONSECUTIVE | 0.032538 | 0.032538 | 0.032538 | 0.484848 | 0.575758 | 1.891680 | 326.807000 | 2.098430 | 294.608000 | 90.15% |
64 | 64 | CONSECUTIVE | CONSECUTIVE | 0.016519 | 0.016519 | 0.016519 | 0.492308 | 0.538462 | 2.169700 | 266.474000 | 2.306210 | 250.701000 | 94.08% |
128 | 128 | CONSECUTIVE | CONSECUTIVE | 0.008324 | 0.008324 | 0.008324 | 0.496124 | 0.519380 | 3.181600 | 175.283000 | 3.193500 | 174.629000 | 99.63% |
256 | 256 | CONSECUTIVE | CONSECUTIVE | 0.004178 | 0.004178 | 0.004178 | 0.498054 | 0.509727 | 3.356580 | 163.058000 | 3.381950 | 161.834000 | 99.25% |
512 | 512 | CONSECUTIVE | CONSECUTIVE | 0.002093 | 0.002093 | 0.002093 | 0.499024 | 0.504872 | 4.032830 | 134.422000 | 3.437120 | 157.720000 | 117.33% |
1024 | 1024 | CONSECUTIVE | CONSECUTIVE | 0.001048 | 0.001048 | 0.001048 | 0.499512 | 0.502439 | 4.960060 | 108.767000 | 5.690910 | 94.798400 | 87.16% |
4096 | 4096 | CONSECUTIVE | CONSECUTIVE | 0.000262 | 0.000262 | 0.000262 | 0.499878 | 0.500610 | 4.012350 | 133.968000 | 5.090080 | 105.603000 | 78.83% |
2 | 2 | CONSECUTIVE | SHFL | 0.357914 | 0.357914 | 0.357914 | 0.333333 | 1.333333 | 50.586300 | 28.301300 | 56.376300 | 25.394600 | 89.73% |
4 | 4 | CONSECUTIVE | SHFL | 0.214748 | 0.214748 | 0.214748 | 0.400000 | 1.000000 | 31.578500 | 34.002300 | 34.485900 | 31.135700 | 91.57% |
8 | 8 | CONSECUTIVE | SHFL | 0.119305 | 0.119305 | 0.119305 | 0.444444 | 0.777778 | 18.002200 | 46.390600 | 20.469100 | 40.799700 | 87.95% |
16 | 16 | CONSECUTIVE | SHFL | 0.063161 | 0.063161 | 0.063161 | 0.470588 | 0.647059 | 9.599460 | 72.376400 | 10.282200 | 67.570300 | 93.36% |
32 | 32 | CONSECUTIVE | SHFL | 0.032538 | 0.032538 | 0.032538 | 0.484848 | 0.575758 | 7.941060 | 77.850500 | 8.469500 | 72.993000 | 93.76% |
64 | 64 | CONSECUTIVE | SHFL | 0.016519 | 0.016519 | 0.016519 | 0.492308 | 0.538462 | 8.077950 | 71.573700 | 7.940700 | 72.810800 | 101.73% |
128 | 128 | CONSECUTIVE | SHFL | 0.008324 | 0.008324 | 0.008324 | 0.496124 | 0.519380 | 8.560700 | 65.144200 | 8.692510 | 64.156400 | 98.48% |
256 | 256 | CONSECUTIVE | SHFL | 0.004178 | 0.004178 | 0.004178 | 0.498054 | 0.509727 | 8.261570 | 66.248400 | 8.651580 | 63.261900 | 95.49% |
512 | 512 | CONSECUTIVE | SHFL | 0.002093 | 0.002093 | 0.002093 | 0.499024 | 0.504872 | 8.394620 | 64.577300 | 8.646400 | 62.696900 | 97.09% |
1024 | 1024 | CONSECUTIVE | SHFL | 0.001048 | 0.001048 | 0.001048 | 0.499512 | 0.502439 | 8.386400 | 64.329100 | 8.718530 | 61.878500 | 96.19% |
4096 | 4096 | CONSECUTIVE | SHFL | 0.000262 | 0.000262 | 0.000262 | 0.499878 | 0.500610 | 7.356190 | 73.071200 | 8.759580 | 61.364300 | 83.98% |
2 | 2 | SHFL | CONSECUTIVE | 0.357914 | 0.357914 | 0.357914 | 0.333333 | 1.333333 | 20.325500 | 70.436500 | 26.223800 | 54.593700 | 77.51% |
4 | 4 | SHFL | CONSECUTIVE | 0.214748 | 0.214748 | 0.214748 | 0.400000 | 1.000000 | 13.297300 | 80.749000 | 15.982300 | 67.183200 | 83.20% |
8 | 8 | SHFL | CONSECUTIVE | 0.119305 | 0.119305 | 0.119305 | 0.444444 | 0.777778 | 7.824000 | 106.740000 | 9.367580 | 89.151300 | 83.52% |
16 | 16 | SHFL | CONSECUTIVE | 0.063161 | 0.063161 | 0.063161 | 0.470588 | 0.647059 | 4.461790 | 155.716000 | 5.174910 | 134.258000 | 86.22% |
32 | 32 | SHFL | CONSECUTIVE | 0.032538 | 0.032538 | 0.032538 | 0.484848 | 0.575758 | 3.184740 | 194.118000 | 3.553890 | 173.955000 | 89.61% |
64 | 64 | SHFL | CONSECUTIVE | 0.016519 | 0.016519 | 0.016519 | 0.492308 | 0.538462 | 3.536380 | 163.491000 | 3.474500 | 166.404000 | 101.78% |
128 | 128 | SHFL | CONSECUTIVE | 0.008324 | 0.008324 | 0.008324 | 0.496124 | 0.519380 | 4.170270 | 133.727000 | 4.114560 | 135.538000 | 101.35% |
256 | 256 | SHFL | CONSECUTIVE | 0.004178 | 0.004178 | 0.004178 | 0.498054 | 0.509727 | 5.023330 | 108.955000 | 4.459870 | 122.720000 | 112.63% |
512 | 512 | SHFL | CONSECUTIVE | 0.002093 | 0.002093 | 0.002093 | 0.499024 | 0.504872 | 6.173220 | 87.815300 | 4.829820 | 112.241000 | 127.81% |
1024 | 1024 | SHFL | CONSECUTIVE | 0.001048 | 0.001048 | 0.001048 | 0.499512 | 0.502439 | 5.178560 | 104.177000 | 6.105310 | 88.363900 | 84.82% |
4096 | 4096 | SHFL | CONSECUTIVE | 0.000262 | 0.000262 | 0.000262 | 0.499878 | 0.500610 | 4.300130 | 125.002000 | 5.436320 | 98.876800 | 79.10% |
2 | 2 | SHFL | SHFL | 0.357914 | 0.357914 | 0.357914 | 0.333333 | 1.333333 | 55.162900 | 25.953200 | 61.979200 | 23.099000 | 89.00% |
4 | 4 | SHFL | SHFL | 0.214748 | 0.214748 | 0.214748 | 0.400000 | 1.000000 | 36.166200 | 29.689100 | 41.238000 | 26.037700 | 87.70% |
8 | 8 | SHFL | SHFL | 0.119305 | 0.119305 | 0.119305 | 0.444444 | 0.777778 | 20.368300 | 41.001600 | 23.170100 | 36.043500 | 87.91% |
16 | 16 | SHFL | SHFL | 0.063161 | 0.063161 | 0.063161 | 0.470588 | 0.647059 | 11.429500 | 60.787900 | 12.343000 | 56.289000 | 92.60% |
32 | 32 | SHFL | SHFL | 0.032538 | 0.032538 | 0.032538 | 0.484848 | 0.575758 | 11.688100 | 52.892800 | 11.503300 | 53.742600 | 101.61% |
64 | 64 | SHFL | SHFL | 0.016519 | 0.016519 | 0.016519 | 0.492308 | 0.538462 | 10.376100 | 55.721000 | 11.411700 | 50.664600 | 90.93% |
128 | 128 | SHFL | SHFL | 0.008324 | 0.008324 | 0.008324 | 0.496124 | 0.519380 | 10.765200 | 51.803900 | 11.311900 | 49.300400 | 95.17% |
256 | 256 | SHFL | SHFL | 0.004178 | 0.004178 | 0.004178 | 0.498054 | 0.509727 | 10.185700 | 53.733700 | 11.342400 | 48.253900 | 89.80% |
512 | 512 | SHFL | SHFL | 0.002093 | 0.002093 | 0.002093 | 0.499024 | 0.504872 | 10.719000 | 50.574100 | 11.362600 | 47.709300 | 94.34% |
1024 | 1024 | SHFL | SHFL | 0.001048 | 0.001048 | 0.001048 | 0.499512 | 0.502439 | 10.132100 | 53.245700 | 11.375600 | 47.425100 | 89.07% |
4096 | 4096 | SHFL | SHFL | 0.000262 | 0.000262 | 0.000262 | 0.499878 | 0.500610 | 8.641660 | 62.201700 | 11.305000 | 47.547500 | 76.44% |
No Aliased Loads, Varying Buffer Size
using AtomicT=uint8_t; using TLEV_ALIAS_TYPE=uint8_t;
We now look at varying buffer sizes, where buffer sizes are uniformly distributed in [<Min. buffer size>, <max. buffer size>]
. This highlights how resilient a method is to load imbalance.
Data
Min. buffer size: | max. buffer size: | in_gen: | out_gen: | src size: | dst size: | sizes size: | data_size: | total: | duration (SPPS): | BW (SPPS): | duration (TWP): | BW (TWP): | relative performance |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | CONSECUTIVE | CONSECUTIVE | 0.357914 | 0.357914 | 0.357914 | 0.249993 | 1.249993 | 4.596700 | 291.985000 | 8.607710 | 155.926000 | 53.40% |
1 | 4 | CONSECUTIVE | CONSECUTIVE | 0.214748 | 0.214748 | 0.214748 | 0.249994 | 0.849994 | 3.141120 | 290.557000 | 5.651420 | 161.495000 | 55.58% |
1 | 8 | CONSECUTIVE | CONSECUTIVE | 0.119305 | 0.119305 | 0.119305 | 0.250006 | 0.583340 | 2.129500 | 294.132000 | 3.647010 | 171.745000 | 58.39% |
1 | 16 | CONSECUTIVE | CONSECUTIVE | 0.063161 | 0.063161 | 0.063161 | 0.249998 | 0.426469 | 1.453890 | 314.961000 | 2.061310 | 222.149000 | 70.53% |
1 | 32 | CONSECUTIVE | CONSECUTIVE | 0.032538 | 0.032538 | 0.032538 | 0.250025 | 0.340934 | 1.083390 | 337.898000 | 1.477500 | 247.766000 | 73.33% |
1 | 64 | CONSECUTIVE | CONSECUTIVE | 0.016519 | 0.016519 | 0.016519 | 0.249996 | 0.296150 | 0.906016 | 350.974000 | 1.486850 | 213.867000 | 60.94% |
1 | 128 | CONSECUTIVE | CONSECUTIVE | 0.008324 | 0.008324 | 0.008324 | 0.249913 | 0.273169 | 0.862752 | 339.974000 | 1.688900 | 173.671000 | 51.08% |
1 | 256 | CONSECUTIVE | CONSECUTIVE | 0.004178 | 0.004178 | 0.004178 | 0.249916 | 0.261589 | 0.882688 | 318.209000 | 1.858720 | 151.115000 | 47.49% |
1 | 512 | CONSECUTIVE | CONSECUTIVE | 0.002093 | 0.002093 | 0.002093 | 0.249880 | 0.255728 | 1.059650 | 259.130000 | 2.133860 | 128.681000 | 49.66% |
1 | 1024 | CONSECUTIVE | CONSECUTIVE | 0.001048 | 0.001048 | 0.001048 | 0.249926 | 0.252852 | 0.846240 | 320.829000 | 2.534530 | 107.120000 | 33.39% |
1 | 4096 | CONSECUTIVE | CONSECUTIVE | 0.000262 | 0.000262 | 0.000262 | 0.249683 | 0.250416 | 1.787710 | 150.405000 | 3.003650 | 89.518400 | 59.52% |
1 | 2 | CONSECUTIVE | SHFL | 0.357914 | 0.357914 | 0.357914 | 0.249993 | 1.249993 | 48.814200 | 27.495400 | 56.818300 | 23.622100 | 85.91% |
1 | 4 | CONSECUTIVE | SHFL | 0.214748 | 0.214748 | 0.214748 | 0.249994 | 0.849994 | 28.978200 | 31.495200 | 32.700900 | 27.909800 | 88.62% |
1 | 8 | CONSECUTIVE | SHFL | 0.119305 | 0.119305 | 0.119305 | 0.250006 | 0.583340 | 16.414600 | 38.158600 | 19.218200 | 32.591700 | 85.41% |
1 | 16 | CONSECUTIVE | SHFL | 0.063161 | 0.063161 | 0.063161 | 0.249998 | 0.426469 | 8.966460 | 51.070000 | 9.698660 | 47.214500 | 92.45% |
1 | 32 | CONSECUTIVE | SHFL | 0.032538 | 0.032538 | 0.032538 | 0.250025 | 0.340934 | 4.959360 | 73.815100 | 6.123740 | 59.779700 | 80.99% |
1 | 64 | CONSECUTIVE | SHFL | 0.016519 | 0.016519 | 0.016519 | 0.249996 | 0.296150 | 3.322850 | 95.697500 | 5.237570 | 60.713000 | 63.44% |
1 | 128 | CONSECUTIVE | SHFL | 0.008324 | 0.008324 | 0.008324 | 0.249913 | 0.273169 | 2.408350 | 121.790000 | 4.298780 | 68.231600 | 56.02% |
1 | 256 | CONSECUTIVE | SHFL | 0.004178 | 0.004178 | 0.004178 | 0.249916 | 0.261589 | 1.752480 | 160.275000 | 3.876450 | 72.458000 | 45.21% |
1 | 512 | CONSECUTIVE | SHFL | 0.002093 | 0.002093 | 0.002093 | 0.249880 | 0.255728 | 1.450690 | 189.280000 | 3.607840 | 76.108200 | 40.21% |
1 | 1024 | CONSECUTIVE | SHFL | 0.001048 | 0.001048 | 0.001048 | 0.249926 | 0.252852 | 0.955200 | 284.232000 | 3.463330 | 78.392300 | 27.58% |
1 | 4096 | CONSECUTIVE | SHFL | 0.000262 | 0.000262 | 0.000262 | 0.249683 | 0.250416 | 1.826110 | 147.243000 | 3.522820 | 76.325800 | 51.84% |
1 | 2 | SHFL | CONSECUTIVE | 0.357914 | 0.357914 | 0.357914 | 0.249993 | 1.249993 | 19.359000 | 69.330400 | 26.145700 | 51.334200 | 74.04% |
1 | 4 | SHFL | CONSECUTIVE | 0.214748 | 0.214748 | 0.214748 | 0.249994 | 0.849994 | 12.373200 | 73.762500 | 16.047400 | 56.873800 | 77.10% |
1 | 8 | SHFL | CONSECUTIVE | 0.119305 | 0.119305 | 0.119305 | 0.250006 | 0.583340 | 7.264740 | 86.218700 | 9.101380 | 68.819900 | 79.82% |
1 | 16 | SHFL | CONSECUTIVE | 0.063161 | 0.063161 | 0.063161 | 0.249998 | 0.426469 | 4.251100 | 107.717000 | 4.975070 | 92.042400 | 85.45% |
1 | 32 | SHFL | CONSECUTIVE | 0.032538 | 0.032538 | 0.032538 | 0.250025 | 0.340934 | 2.534140 | 144.457000 | 3.146750 | 116.334000 | 80.53% |
1 | 64 | SHFL | CONSECUTIVE | 0.016519 | 0.016519 | 0.016519 | 0.249996 | 0.296150 | 1.587200 | 200.345000 | 2.541380 | 125.124000 | 62.45% |
1 | 128 | SHFL | CONSECUTIVE | 0.008324 | 0.008324 | 0.008324 | 0.249913 | 0.273169 | 1.161250 | 252.584000 | 2.278080 | 128.754000 | 50.97% |
1 | 256 | SHFL | CONSECUTIVE | 0.004178 | 0.004178 | 0.004178 | 0.249916 | 0.261589 | 1.019070 | 275.623000 | 2.418820 | 116.123000 | 42.13% |
1 | 512 | SHFL | CONSECUTIVE | 0.002093 | 0.002093 | 0.002093 | 0.249880 | 0.255728 | 1.100930 | 249.413000 | 2.571230 | 106.792000 | 42.82% |
1 | 1024 | SHFL | CONSECUTIVE | 0.001048 | 0.001048 | 0.001048 | 0.249926 | 0.252852 | 0.944928 | 287.322000 | 2.851710 | 95.205300 | 33.14% |
1 | 4096 | SHFL | CONSECUTIVE | 0.000262 | 0.000262 | 0.000262 | 0.249683 | 0.250416 | 2.053470 | 130.940000 | 3.184030 | 84.446900 | 64.49% |
1 | 2 | SHFL | SHFL | 0.357914 | 0.357914 | 0.357914 | 0.249993 | 1.249993 | 55.349800 | 24.248800 | 61.491900 | 21.826800 | 90.01% |
1 | 4 | SHFL | SHFL | 0.214748 | 0.214748 | 0.214748 | 0.249994 | 0.849994 | 33.883500 | 26.935600 | 37.047000 | 24.635500 | 91.46% |
1 | 8 | SHFL | SHFL | 0.119305 | 0.119305 | 0.119305 | 0.250006 | 0.583340 | 19.901200 | 31.473300 | 21.161800 | 29.598400 | 94.04% |
1 | 16 | SHFL | SHFL | 0.063161 | 0.063161 | 0.063161 | 0.249998 | 0.426469 | 10.758600 | 42.563100 | 11.525800 | 39.729700 | 93.34% |
1 | 32 | SHFL | SHFL | 0.032538 | 0.032538 | 0.032538 | 0.250025 | 0.340934 | 5.817920 | 62.922100 | 7.820100 | 46.812200 | 74.40% |
1 | 64 | SHFL | SHFL | 0.016519 | 0.016519 | 0.016519 | 0.249996 | 0.296150 | 3.873250 | 82.098600 | 6.578460 | 48.337800 | 58.88% |
1 | 128 | SHFL | SHFL | 0.008324 | 0.008324 | 0.008324 | 0.249913 | 0.273169 | 2.744930 | 106.856000 | 5.650620 | 51.908100 | 48.58% |
1 | 256 | SHFL | SHFL | 0.004178 | 0.004178 | 0.004178 | 0.249916 | 0.261589 | 1.927710 | 145.706000 | 5.295460 | 53.041600 | 36.40% |
1 | 512 | SHFL | SHFL | 0.002093 | 0.002093 | 0.002093 | 0.249880 | 0.255728 | 1.533060 | 179.110000 | 5.134270 | 53.481100 | 29.86% |
1 | 1024 | SHFL | SHFL | 0.001048 | 0.001048 | 0.001048 | 0.249926 | 0.252852 | 1.016000 | 267.223000 | 5.050750 | 53.754000 | 20.12% |
1 | 4096 | SHFL | SHFL | 0.000262 | 0.000262 | 0.000262 | 0.249683 | 0.250416 | 2.069410 | 129.932000 | 5.016030 | 53.604500 | 41.26% |
16B-aligned buffers, 4B-aliased copies, Varying Buffer Size
using AtomicT=uint4; using TLEV_ALIAS_TYPE=uint32_t;
This experiment analyses the benefit that aliasing loads (i.e., reinterpret_cast<TLEV_ALIAS_TYPE*>(..)
) could have. Note that the buffer size of was bumped to be at least 16B here. We also tested what impact vectorised loads would have.
Data
Min. buffer size: | max. buffer size: | in_gen: | out_gen: | src size: | dst size: | sizes size: | data_size: | total: | duration (SPPS): | BW (SPPS): | duration (TWP): | BW (TWP): | duration (TWP) uint4 VECT: | BW (TWP) uint4 VECT: |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16 | 16 | CONSECUTIVE | CONSECUTIVE | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 0.787264 | 468.837000 | 1.163460 | 317.243000 | 1.162980 | 317.374000 |
16 | 16 | CONSECUTIVE | CONSECUTIVE | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 0.781184 | 472.486000 | 1.165380 | 316.721000 | 1.161570 | 317.759000 |
16 | 16 | CONSECUTIVE | CONSECUTIVE | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 0.792992 | 465.451000 | 1.168290 | 315.931000 | 1.166080 | 316.530000 |
16 | 16 | CONSECUTIVE | CONSECUTIVE | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 0.793984 | 464.869000 | 1.168860 | 315.776000 | 1.162880 | 317.401000 |
32 | 32 | CONSECUTIVE | CONSECUTIVE | 0.022370 | 0.022370 | 0.022370 | 0.333333 | 0.395833 | 0.769184 | 552.563000 | 1.028770 | 413.138000 | 0.831776 | 308.463000 |
64 | 64 | CONSECUTIVE | CONSECUTIVE | 0.013422 | 0.013422 | 0.013422 | 0.400000 | 0.437500 | 0.802336 | 585.493000 | 1.062530 | 442.117000 | 0.671552 | 386.203000 |
128 | 128 | CONSECUTIVE | CONSECUTIVE | 0.007457 | 0.007457 | 0.007457 | 0.444444 | 0.465278 | 0.867104 | 576.157000 | 1.267010 | 394.305000 | 0.763072 | 344.623000 |
256 | 256 | CONSECUTIVE | CONSECUTIVE | 0.003948 | 0.003948 | 0.003948 | 0.470588 | 0.481618 | 0.997184 | 518.593000 | 1.258050 | 411.060000 | 0.982496 | 270.170000 |
512 | 512 | CONSECUTIVE | CONSECUTIVE | 0.002034 | 0.002034 | 0.002034 | 0.484848 | 0.490530 | 1.165660 | 451.848000 | 1.420580 | 370.767000 | 1.217820 | 219.087000 |
1024 | 1024 | CONSECUTIVE | CONSECUTIVE | 0.001032 | 0.001032 | 0.001032 | 0.492308 | 0.495192 | 1.124000 | 473.050000 | 2.384060 | 223.026000 | 1.493440 | 179.165000 |
4096 | 4096 | CONSECUTIVE | CONSECUTIVE | 0.000261 | 0.000261 | 0.000261 | 0.498047 | 0.498776 | 2.643140 | 202.622000 | 2.457500 | 217.927000 | 1.751390 | 152.954000 |
16 | 16 | CONSECUTIVE | SHFL | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 4.878300 | 75.661300 | 5.574110 | 66.216600 | 5.578180 | 66.168400 |
16 | 16 | CONSECUTIVE | SHFL | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 4.876960 | 75.682100 | 5.567520 | 66.295000 | 5.589150 | 66.038400 |
16 | 16 | CONSECUTIVE | SHFL | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 4.876290 | 75.692600 | 5.567840 | 66.291200 | 5.579940 | 66.147500 |
16 | 16 | CONSECUTIVE | SHFL | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 4.863940 | 75.884800 | 5.579390 | 66.153900 | 5.601250 | 65.895800 |
32 | 32 | CONSECUTIVE | SHFL | 0.022370 | 0.022370 | 0.022370 | 0.333333 | 0.395833 | 2.298430 | 184.919000 | 4.205730 | 101.058000 | 3.670780 | 69.895800 |
64 | 64 | CONSECUTIVE | SHFL | 0.013422 | 0.013422 | 0.013422 | 0.400000 | 0.437500 | 1.488800 | 315.531000 | 2.546590 | 184.467000 | 2.840000 | 91.322400 |
128 | 128 | CONSECUTIVE | SHFL | 0.007457 | 0.007457 | 0.007457 | 0.444444 | 0.465278 | 0.930208 | 537.071000 | 2.278430 | 219.268000 | 2.357250 | 111.559000 |
256 | 256 | CONSECUTIVE | SHFL | 0.003948 | 0.003948 | 0.003948 | 0.470588 | 0.481618 | 1.015680 | 509.150000 | 2.528100 | 204.554000 | 2.127580 | 124.762000 |
512 | 512 | CONSECUTIVE | SHFL | 0.002034 | 0.002034 | 0.002034 | 0.484848 | 0.490530 | 1.155390 | 455.865000 | 2.660350 | 197.982000 | 1.935900 | 137.822000 |
1024 | 1024 | CONSECUTIVE | SHFL | 0.001032 | 0.001032 | 0.001032 | 0.492308 | 0.495192 | 1.134820 | 468.542000 | 2.987360 | 177.986000 | 1.912030 | 139.941000 |
4096 | 4096 | CONSECUTIVE | SHFL | 0.000261 | 0.000261 | 0.000261 | 0.498047 | 0.498776 | 2.639420 | 202.907000 | 2.936930 | 182.353000 | 1.874210 | 142.931000 |
16 | 16 | SHFL | CONSECUTIVE | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 2.227260 | 165.718000 | 2.585470 | 142.759000 | 2.585380 | 142.764000 |
16 | 16 | SHFL | CONSECUTIVE | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 2.230660 | 165.466000 | 2.594050 | 142.287000 | 2.594500 | 142.262000 |
16 | 16 | SHFL | CONSECUTIVE | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 2.224450 | 165.928000 | 2.583010 | 142.895000 | 2.594080 | 142.285000 |
16 | 16 | SHFL | CONSECUTIVE | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 2.224380 | 165.933000 | 2.587460 | 142.649000 | 2.600770 | 141.919000 |
32 | 32 | SHFL | CONSECUTIVE | 0.022370 | 0.022370 | 0.022370 | 0.333333 | 0.395833 | 1.671490 | 254.278000 | 1.898340 | 223.892000 | 1.747550 | 146.818000 |
64 | 64 | SHFL | CONSECUTIVE | 0.013422 | 0.013422 | 0.013422 | 0.400000 | 0.437500 | 1.261920 | 372.260000 | 1.294340 | 362.937000 | 1.144320 | 226.646000 |
128 | 128 | SHFL | CONSECUTIVE | 0.007457 | 0.007457 | 0.007457 | 0.444444 | 0.465278 | 0.943456 | 529.530000 | 1.437020 | 347.655000 | 0.928096 | 283.345000 |
256 | 256 | SHFL | CONSECUTIVE | 0.003948 | 0.003948 | 0.003948 | 0.470588 | 0.481618 | 0.990752 | 521.960000 | 1.576290 | 328.070000 | 1.074750 | 246.979000 |
512 | 512 | SHFL | CONSECUTIVE | 0.002034 | 0.002034 | 0.002034 | 0.484848 | 0.490530 | 1.180320 | 446.237000 | 1.808800 | 291.189000 | 1.290300 | 206.780000 |
1024 | 1024 | SHFL | CONSECUTIVE | 0.001032 | 0.001032 | 0.001032 | 0.492308 | 0.495192 | 1.127840 | 471.440000 | 2.414750 | 220.192000 | 1.533250 | 174.513000 |
4096 | 4096 | SHFL | CONSECUTIVE | 0.000261 | 0.000261 | 0.000261 | 0.498047 | 0.498776 | 2.654460 | 201.757000 | 2.420610 | 221.249000 | 1.748260 | 153.228000 |
16 | 16 | SHFL | SHFL | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 5.552960 | 66.468800 | 6.301150 | 58.576400 | 6.299970 | 58.587400 |
16 | 16 | SHFL | SHFL | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 5.555360 | 66.440100 | 6.294780 | 58.635600 | 6.299710 | 58.589800 |
16 | 16 | SHFL | SHFL | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 5.568160 | 66.287400 | 6.288480 | 58.694400 | 6.313380 | 58.463000 |
16 | 16 | SHFL | SHFL | 0.033554 | 0.033554 | 0.033554 | 0.250000 | 0.343750 | 5.553220 | 66.465800 | 6.325470 | 58.351200 | 6.300960 | 58.578200 |
32 | 32 | SHFL | SHFL | 0.022370 | 0.022370 | 0.022370 | 0.333333 | 0.395833 | 3.292060 | 129.105000 | 5.389600 | 78.859800 | 4.105500 | 62.494700 |
64 | 64 | SHFL | SHFL | 0.013422 | 0.013422 | 0.013422 | 0.400000 | 0.437500 | 1.858560 | 252.756000 | 3.066980 | 153.168000 | 3.226050 | 80.394200 |
128 | 128 | SHFL | SHFL | 0.007457 | 0.007457 | 0.007457 | 0.444444 | 0.465278 | 0.987744 | 505.787000 | 2.813630 | 177.560000 | 2.630940 | 99.953400 |
256 | 256 | SHFL | SHFL | 0.003948 | 0.003948 | 0.003948 | 0.470588 | 0.481618 | 0.987456 | 523.702000 | 3.006820 | 171.987000 | 2.349660 | 112.970000 |
512 | 512 | SHFL | SHFL | 0.002034 | 0.002034 | 0.002034 | 0.484848 | 0.490530 | 1.174460 | 448.462000 | 3.121380 | 168.740000 | 2.081310 | 128.193000 |
1024 | 1024 | SHFL | SHFL | 0.001032 | 0.001032 | 0.001032 | 0.492308 | 0.495192 | 1.128220 | 471.279000 | 3.200990 | 166.107000 | 1.975810 | 135.424000 |
4096 | 4096 | SHFL | SHFL | 0.000261 | 0.000261 | 0.000261 | 0.498047 | 0.498776 | 2.652130 | 201.935000 | 3.174370 | 168.713000 | 1.910080 | 140.246000 |
Sorry for the wait. I did another clean up pass over the code of this PR.
I've long wanted a
cuda::memcpy
that would handle runtime determined alignment as well as take a CG parameter to use multiple threads to perform the copy. That seems like the best place to put such a building block as it could have widespread applicability.
Agreed, @jrhemstad. I believe there's a recurring need for it, especially when dealing with string data. I often find myself needing to load string data into shared memory for further processing, so I tried to be transparent to the destination data space. I.e., to support vectorised stores of 4B
, 8B
, and 16B
. Where 4B
and 8B
stores are friendly to the shared memory space, reducing bank conflicts and the 16B
stores are presumably more efficient for global memory stores.
I hope I've been able to take a first step into that direction. In the interest of getting this PR through, I haven't exposed it as a stand-alone CG/block-level algorithm yet and hidden it under the detail
namespace for now. I plan to expose it to the public API and add typed tests in a follow-up PR. This is currently the signature (as I believe we don't have CG in CUB yet(?)):
VectorizedCopy(int32_t thread_rank, int32_t group_size, void *dest, ByteOffsetT num_bytes, const void *src)
We don't expose CG in the CUB APIs, this would require some more discussion before we added anything like that. That may be better suited to the senders/receivers based APIs that @senior-zero is working on. For now, let's try to find a way to pass the same info in without adding any dependencies.
We don't expose CG in the CUB APIs, this would require some more discussion before we added anything like that. That may be better suited to the senders/receivers based APIs that @senior-zero is working on. For now, let's try to find a way to pass the same info in without adding any dependencies.
Thanks. Yup, thought so too. This would be the proposed alternative interface for the time being. Worth noting, that the implementation does not require synchronisation (which makes it easier to not rely on CG).
VectorizedCopy(int32_t thread_rank, int32_t group_size, void *dest, ByteOffsetT num_bytes, const void *src)