CUDA.jl
CUDA.jl copied to clipboard
Explore NVPTX's sched4reg
static cl::opt<bool> sched4reg(
"nvptx-sched4reg",
cl::desc("NVPTX Specific: schedule for register pressue"), cl::init(false));
Also:
// LSV is still relatively new; this switch lets us turn it off in case we
// encounter (or suspect) a bug.
// TODO/NOTE: don't want this when under register pressure
static cl::opt<bool>
DisableLoadStoreVectorizer("disable-nvptx-load-store-vectorizer",
cl::desc("Disable load/store vectorizer"),
cl::init(false), cl::Hidden);
@wsmoses and I had a good benchmark for this in https://github.com/wsmoses/Enzyme-GPU-Tests/tree/main/DG/cuda the reverse code performance notably worse on CUDA.jl then on AMDGPU.jl
You can try easily with LLVM.clopts("--nvptx-sched4reg")