llvm
llvm copied to clipboard
Is a 2d array treated internally as 1d array at assembly level in dpcpp for NVIDIA BACKEND?
Describe the bug I am trying to run a program on NVIDIA backend. The code is similar in CUDA and DPCPP version, but dpcpp code considerably longer time(x10) compared to cuda version.
Profiling for cuda,
Profiling for dpcpp
Does this mean dpcpp treats 2d arrays internally as 1d?
To Reproduce Please describe the steps to reproduce the behavior: The actual program is big. But this is a simple code to show the behavior,
SYCL VERSION
#include <CL/sycl.hpp>
#include
#define N 150
void initializeInput(int *arr) { for (int i = 0; i < N; ++i) { for (int j = 0; j < N; ++j) { arr[i * N + j] = 20; } } }
void compute2DAvg(sycl::item<1> item_ct1, int *d_inputArr, float *avg) { int arr[N][N]; for (int i = 0; i < N; ++i) { for (int j = 0; j < N; ++j) { arr[i][j] = d_inputArr[i * N + j]; } }
float avg_temp = 0.0;
int sum = 0;
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
sum += arr[i][j];
}
}
avg_temp = sum / (N * N);
*avg = avg_temp;
}
int main() { sycl::gpu_selector device_selector; sycl::queue q_ct(device_selector); float kernTime = 0.0;
float *h_avg = (float *)malloc(sizeof(float));
float *d_avg = (float *)sycl::malloc_device(sizeof(float), q_ct);
int *h_inputArr = (int *)malloc(sizeof(int) * N * N);
int *d_inputArr = (int *)sycl::malloc_device(sizeof(int) * N * N, q_ct);
initializeInput(h_inputArr);
q_ct.memcpy(d_inputArr, h_inputArr, sizeof(int) * N * N);
for (size_t i = 0; i < 5; i++)
{
auto start_kernel = std::chrono::steady_clock::now();
q_ct.submit([&](sycl::handler &cgh)
{ cgh.parallel_for(
sycl::range<1>(1), [=](sycl::item<1> item_ct1)
{ compute2DAvg(item_ct1, d_inputArr, d_avg); }); })
.wait();
auto stop_kernel = std::chrono::steady_clock::now();
kernTime = std::chrono::duration<float, std::micro>(stop_kernel - start_kernel).count();
}
q_ct.memcpy(h_avg, d_avg, sizeof(float));
printf("The computed 2d avg is %f \n", *h_avg);
printf("Kernel execution time is %f \n", kernTime);
free(h_avg);
free(h_inputArr);
sycl::free(d_avg, q_ct);
sycl::free(d_inputArr, q_ct);
return 0;
}
CUDA VERSION
#include
#define N 150
void initializeInput(int *arr) { for (int i = 0; i < N; ++i) { for (int j = 0; j < N; ++j) { arr[i * N + j] = 17; } } }
global void compute2DAvg(int *d_inputArr, float *avg) { int arr[N][N]; for (int i = 0; i < N; ++i) { for (int j = 0; j < N; ++j) { arr[i][j] = d_inputArr[i * N + j]; } }
float avg_temp = 0.0;
int sum = 0;
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
sum += arr[i][j];
}
}
avg_temp = sum / (N * N);
*avg = avg_temp;
}
int main() { float *d_avg; int *h_inputArr = (int *)malloc(sizeof(int) * N * N); int *d_inputArr; float *h_avg = (float *)malloc(sizeof(float)); float kernTime = 0.0;
initializeInput(h_inputArr);
cudaMalloc(&d_avg, sizeof(float));
cudaMalloc(&d_inputArr, sizeof(int) * N * N);
cudaMemcpy(d_inputArr, h_inputArr, sizeof(int) * N * N, cudaMemcpyHostToDevice);
for (size_t i = 0; i < 5; i++)
{
auto start_kernel = std::chrono::steady_clock::now();
compute2DAvg<<<1, 1>>>(d_inputArr, d_avg);
cudaDeviceSynchronize();
auto stop_kernel = std::chrono::steady_clock::now();
kernTime = std::chrono::duration<float, std::micro>(stop_kernel - start_kernel).count();
}
cudaMemcpy(h_avg, d_avg, sizeof(float), cudaMemcpyDeviceToHost);
printf("The computed 2d avg is %f \n", *h_avg);
printf("Kernel execution time is %f \n", kernTime);
free(h_avg);
free(h_inputArr);
cudaFree(d_avg);
cudaFree(d_inputArr);
return 0;
}
Below is the actual project attached. All the kernels are 10x faster for similar code on NVIDIA BACKEND, tried many things but nothing worked out. https://github.com/intel-innersource/applications.benchmarking.oneapi.onebench/tree/anudeep_dev_branch/cudaSift
- Include code snippet as short as possible
- Specify the command which should be used to compile the program
- Specify the comment which should be used to launch the program
- Indicate what is wrong and what was expected
Environment (please complete the following information):
- OS: [Linux]
- Target device and vendor: [NVIDIA GPU]
- DPC++ version: [clang version 14.0.0]
- Dependencies version: [opencv] Additional context Add any other context about the problem here.
(I'm assuming you are comparing DPC++ against NVCC)
I think you are just noticing a general LLVM/NVPTX vs NVCC optimization difference.
In your sample, looking at the output of the cuda code from clang/llvm https://godbolt.org/z/z1v1jjnTv
and from nvcc https://godbolt.org/z/x9v35MaWq
you can see that NVCC is fully unrolling the inner loops while llvm only does a small partial unrolling. As you are running 1 thread, ILP is playing a big role here.
Looking at the profiler output, it seems something similar is happening as the nvcc version is using way more registers. I don't now SASS enough to tell if the difference in ld instruction is significant.
On a side note, your profiler output shows no fma are used with dpcpp, so you probably are missing some fast math flags (enabled by default by nvcc).
(I'm assuming you are comparing DPC++ against NVCC)
I think you are just noticing a general LLVM/NVPTX vs NVCC optimization difference.
In your sample, looking at the output of the cuda code from clang/llvm https://godbolt.org/z/z1v1jjnTv
and from nvcc https://godbolt.org/z/x9v35MaWq
you can see that NVCC is fully unrolling the inner loops while llvm only does a small partial unrolling. As you are running 1 thread, ILP is playing a big role here.
Looking at the profiler output, it seems something similar is happening as the nvcc version is using way more registers. I don't now SASS enough to tell if the difference in ld instruction is significant.
Hi @Naghasan,
I tried using the flags "-ffast-math" and "-mfma" but i dont see any effect. This issue has been blocking our workload development since many weeks. Could please suggest any other options which i might need to try.. It would be very helpful
Hi @AerialMantis , any update on this? Any suggestions for default flags that i might have missed during dpcpp compilation ?
hello @Naghasan @AerialMantis,
Is there any triage being performed on this ? Can we still wait on this thing?
Thanks
@skambapugithub apologies for the delay in replying, we've picked this ticket up, so someone will someone will be looking into it shortly.
@skambapugithub I've looked into the sample you've provided. It seems to me that clang correctly generates floating point fused multiply add instructions, so I would not worry about those flags.
One significant difference to nvcc is the fact that clang fails to fully unroll the innermost loops. It is due to the heuristic used by clang, which is based on the size of unrolled loop (see: https://github.com/intel/llvm/blob/sycl/llvm/include/llvm/Analysis/TargetTransformInfo.h#L439), by default 300. You could override it by setting -unroll-threshold-aggressive, in the synthetic example given by @Naghasan inner loops are 1202 and 752 respectively. Be careful though, as setting that value too high, would result in unrolling the outer loop too, something that you probably would like to avoid. Alternatively, I found those loops fully unrolled when using pragma unroll.
Finally nvcc makes use (for some cases at least) of vector loads and stores. I appreciate that this is only a workaround, but if in your profiling the vector operations prove to bring significant performance boost you might want to consider modifying your kernels to explicitly use vector types. Using the same example a change to int2 results in clang being able to select vector variants of all ld and st (notice either 64 bit or v2 variants), see:
https://godbolt.org/z/15K7GeEeP
@skambapugithub I've looked a bit further into the vector loads/stores, I think clang is right not generating them, from PTX ISA:
By default, vector variables are aligned to a multiple of their overall size (vector length times base-type size), to enable vector load and store instructions which require addresses aligned to a multiple of the access size.
which is enforced in clang's load store vectorizer pass: https://github.com/intel/llvm/blob/sycl/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp#L1081 Feels like using language built-in vector types, and hence assuring the correct alignment, might be the best way forward here.
Hi @jchlanda ,
Thanks for the suggestions. It would be more clear if we can discuss this directly. I tried reaching on teams. Is there a way we could have a meeting?
Hi @jchlanda I have tries the suggestions that we previously discussed. Unfortunately I dint see any improvements on my end. May be it would be great if we could have one more meeting scheduled to discuss the actual issue I have been facing in the project and provide you with the project sources. Is there a way where I can provide you the project source?
Hi @skambapugithub
I've revisited this issue with a recent build and was able to run both CUDA and SYCL versions on a Tesla V100-SXM2-16GB node. I do not see the performance degradation you've mention, for the application in question (cudaSift) the numbers are almost identical:
- SYCL:
Total Time = 170.305 ms, - CUDA:
Total Time = 170.22 ms.
It is true that the time needed for individual stages varies, but it's very much comparable.
I've also looked at the ptx generated and again it's comparable, the same features are used in both kernels: fma instructions, loads/stores to global/shared mem, 64 bit wide addition.
Unless you have any further info on this, I suggest we close this ticket.
Thank you.
I'm going to close this ticket, please feel free to re-open it or create a new one if there is any new information.
Hi Jakub,
Sorry for the late reply. As long the performance is comparable then we are good on this. Thanks for closing the ticket.
Thanks Aundeep
From: Jakub Chlanda @.> Sent: Wednesday, February 15, 2023 2:28 AM To: intel/llvm @.> Cc: Kambapu, Santhosh Anudeep @.>; Mention @.> Subject: Re: [intel/llvm] Is a 2d array treated internally as 1d array at assembly level in dpcpp for NVIDIA BACKEND? (Issue #6195)
Hi @skambapugithubhttps://github.com/skambapugithub I've revisited this issue with a recent buildhttps://github.com/intel/llvm/commit/3df87e20569ea63d0a74de525b6d19788dd8afca and was able to run both CUDA and SYCL versions on a Tesla V100-SXM2-16GB node. I do not see the performance degradation you've mention, for the application in question (cudaSifthttps://github.com/intel-collab/applications.benchmarking.oneapi.onebench/tree/main/cudaSift) the numbers are almost identical:
- SYCL: Total Time = 170.305 ms,
- CUDA: Total Time = 170.22 ms.
It is true that the time needed for individual stages varies, but it's very much comparable.
I've also looked at the ptx generated and again it's comparable, the same features are used in both kernels: fma instructions, loads/stores to global/shared mem, 64 bit wide addition.
Unless you have any further info on this, I suggest we close this ticket.
Thank you.
— Reply to this email directly, view it on GitHubhttps://github.com/intel/llvm/issues/6195#issuecomment-1431098734, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AWMYSR6Q5RJY4QSRWN3APU3WXSVUBANCNFSM5W7G3QLA. You are receiving this because you were mentioned.Message ID: @.@.>>