flang
flang copied to clipboard
Question and errors compiling for -fopenmp-targets=nvptx64-nvidia-cuda
I compiled flang following this guide https://github.com/flang-compiler/flang/wiki/Building-Flang, except that I used release_90, added the NVPTX target, LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES="35,61" and GCC 9.2.
When I use the following program:
program hello
use omp_lib
implicit none
integer, parameter :: N = 1024
integer :: i
real, dimension(N) :: x
real, dimension(N) :: sum
integer, dimension(N) :: thn
integer, dimension(N) :: ten
do i = 1, N
x(i) = 1
sum(i) = 1
end do
print *, "omp_get_num_devices = ", omp_get_num_devices()
!$omp parallel do
do i = 1, N
sum(i) = sum(i) + x(i) *x(i)
thn(i) = omp_get_thread_num()
ten(i) = omp_get_team_num()
end do
!$omp end parallel do
do i = 1, N
print *, "team num = ", ten(i), ", thread num= ",thn(i), ", result: ", sum(i)
end do
end program hello
Compiling it with
flang -fopenmp test0.f90 -o test0_cpu
flang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda test0.f90 -Xopenmp-target -march=sm_61 -o test0_gpu
It works as expected on the CPU with 8 threads, on the GPU I see only 8 threads also. Is this correct?
I modified the example to use target:
print *, "omp_get_num_devices = ", omp_get_num_devices()
!$omp target teams distribute parallel do
do i = 1, N
sum(i) = sum(i) + x(i) *x(i)
thn(i) = omp_get_thread_num()
ten(i) = omp_get_team_num()
end do
!$omp end target teams distribute parallel do
Then I get the following error:
/pathto/flang/tools/flang2/flang2exe/verify.cpp:80: DEBUG_ASSERT 0 < ilix failed
F90-F-0000-Internal compiler error. internal error in verifier itself 0 (test1.f90: 33)
/pathto/flang/tools/flang2/flang2exe/verify.cpp:80: DEBUG_ASSERT 0 < ilix failed
F90-F-0000-Internal compiler error. internal error in verifier itself 0 (test1.f90: 33)
When I comment out "print *, "omp_get_num_devices = "..."
! print *, "omp_get_num_devices = ", omp_get_num_devices()
!$omp target teams distribute parallel do
do i = 1, N
sum(i) = sum(i) + x(i) *x(i)
thn(i) = omp_get_thread_num()
ten(i) = omp_get_team_num()
end do
!$omp end target teams distribute parallel do
I get:
/tmp/test1a-7ce91a.ll:30:82: error: initializer with struct type has wrong # elements
@.openmp.offload.entry.__nv_MAIN__F1L21_1_ = weak global %struct.__tgt_bin_desc { i8* getelementptr(i8, i8* @.openmp.offload.region.__nv_MAIN__F1L21_1_, i32 0), i8* getelementptr(i8, i8* bitcast([19 x i8]* @.C421_MAIN_ to i8*), i32 0) ,i64 0, i32 0, i32 0 }, section ".omp_offloading.entries", align 1
I saved all program codes as: test{0,1,1a}.f90 What I am doing wrong?
Programs with the same function using "C" programming language and Clang 9.0.1 works as expected.
@grypp Güray, is this something you have any information about?
Hello @justxi For the first example, how did you observe 8 threads in GPU? I would not expect that code to run GPU since there is no target region. For the second example, unfortunately, api functions are not implemented for the GPU device. Some of them might work, however, none of them is tested. Does your code work properly if you remove api calls?
Hi @grypp
Hello @justxi For the first example, how did you observe 8 threads in GPU? I would not expect that code to run GPU since there is no target region.
Ok, that would explain that I have the same number of threads as CPU cores.
For the second example, unfortunately, api functions are not implemented for the GPU device. Some of them might work, however, none of them is tested. Does your code work properly if you remove api calls?
I modified the program:
program hello
implicit none
integer, parameter :: N = 1024
integer :: i
real, dimension(N) :: x
real, dimension(N) :: sum
do i = 1, N
x(i) = 1
sum(i) = 1
end do
!$omp target teams distribute parallel do
do i = 1, N
sum(i) = sum(i) + x(i) * x(i)
end do
!$omp end target teams distribute parallel do
do i = 1, N
print *, "result: ", sum(i)
end do
end program hello
Compiling with:
flang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_61 test1b.f90 -o test1b_gpu
But I get this error again:
/tmp/test1b-e4125d.ll:30:86: error: initializer with struct type has wrong # elements
@.openmp.offload.entry.__nv_MAIN__F1L15_1_ = weak global %struct.__tgt_device_image { i8* getelementptr(i8, i8* @.openmp.offload.region.__nv_MAIN__F1L15_1_, i32 0), i8* getelementptr(i8, i8* bitcast([19 x i8]* @.C406_MAIN_ to i8*), i32 0) ,i64 0, i32 0, i32 0 }, section ".omp_offloading.entries", align 1
@grypp Is there an example which is known to work using Fortran to offload to nVidia GPU?
@justxi can you try the following more simple case, and check whether it has some issue?
PROGRAM OFFLOADINF_DEMO
USE OMP_LIB
INTEGER :: isHost = -1
character*16 :: name
!$OMP TARGET MAP (from: isHost)
isHost = OMP_IS_INITIAL_DEVICE()
!$OMP END TARGET
if (isHost < 0) then
PRINT *, "Runtime error, isHost = I3", isHost
end if
! CHECK: Target region executed on the device
if (isHost) then
name = "host"
else
name = "device"
endif
PRINT *,"Target region executed on the ", name
END