Davide Rossetti
Davide Rossetti
cuda_runtime_api.h is included but not really used by test apps, so let us get rid of that inclusion
from Intel manuals: ``` Unlike WC stores and stores with non-temporal hint, direct-stores are eligible for immediate eviction from the write-combining buffer, and thus not combined with younger stores (including...
- run copybw and copylat on Arm64+directly attached GPU - in case, add optimized copy functions, e.g. using Neon intrinsic
on POWER9, wc_store_fence() is defined as sync, which is heavyweight fence including MMIO mappings. while lwsync is enough for cached mappings.
for both RPM and DEB packages also, update metadata so that packages with the new name supersede old ones
- print estimated bw, useful for large buffer sizes - add -d param - add warmup extra iterations and -w param
strawman design: - allocate device memory buffer B - launch CUDA kernel: - polling on B[0] - writing a zero-copy flag - CPU: - wait for the kernel to really...