aomp
aomp copied to clipboard
Build "fat binary" with multiple GPU archs
Is it possible to build "fat binary" which is compatible with multiple GPU architectures? I'm able to build binary with OpenMP target offload for single GPU architecture (such as AOMP_GPU=gfx900), but such a binary fails to run on gfx906 hardware (as expected). Is there any way to specify multiple GPU architectures to be included in the binary?
This is unimplemented but possible. It'll involve generating code for N architectures and embedding them all in the host binary.
It would be more difficult to support running on various different GPU architectures at the same time, e.g. a machine with some gfx803 and some gfx906. Everything is more difficult with cross vendor too, so nvptx64 + amdgcn in one binary would be more challenging to implement.
Is this the simpler build once, deploy to various homogeneous machines use case?
I think the "build once, deploy everywhere" is priority (at least for client applications). However, I hit the issue on my development machine, which has 2 GPUs (RX Vega - gfx900 and Radeon VII - gfx906). In this case OpenMP will report two devices available, but only one can be used as there is no binary for the other - here it could be better to report only those devices which can actually be used.
There are three enhancements here in order of least difficult to implement to most difficult. 1. Fat binary for the same architecture but different GPUs and features. 2) Two architectures such as amdgcn and nvptx64 but only one is active for execution and 3) the combination of two concurrent devices with different architectures or different GPUs during execution. The last would require device type management in the OpenMP standard which is not expected till at least OpenMP 6.0. I would like to see a strong use case for the third to take to the OpenMP Language committee.
This is targeted for next release aomp 13.0-3
AOMP 13.0-3 will only support multiple archs in the binary, not multiple archs concurrently.
In aomp 13.0-4 you can build a multi-arch binary with multiple --offload-arch flags.
How does mapping between devices and images work? For example, I can now compile my code as
aompcc --offload-arch gfx900 --offload-arch gfx906 -O3 main.cpp -o test
However, resulting binary seems to be able to run only on "gfx906" as I can run successfully it with
ROCR_VISIBLE_DEVICES=1 ./test
but with 'ROCR_VISIBLE_DEVICES=0', which is "gfx900" I get
Possible gpu arch mismatch: device:gfx900, image:gfx906 please check compiler flag: -march=
Libomptarget error: Unable to generate entries table for device id 0. Libomptarget error: Failed to init globals on device 0 Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings. Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only. Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Now, if I build only with --offload-arch gfx900
I cant run the binary regardless of ROCR_VISIBLE_DEVICES
and I get
WARNING: Runtime capabilities do NOT meet any image requirements. So device offloading is now disabled. Runtime capabilities : gfx906 xnack- Image 0 requirements : gfx900
or
WARNING: Runtime capabilities do NOT meet any image requirements. So device offloading is now disabled. Runtime capabilities : gfx906 sramecc- xnack- Image 0 requirements : gfx900
Thank you for your patience. We just reviewed your issue in our weekly meeting. The first thing to mention that I believe you are aware of is that you cannot offload to two different device types in the same application instance. Yes, you should be able to build for multiple architectures and isolate the GPUs using ROCR_VISIBLE_DEVICES as you have tried.
We do not have a machine with two different cards for this type of testing and appreciate you helping us get this work.
Can you run these four commands and show us the output.
$AOMP/bin/offload-arch -c env ROCR_VISIBLE_DEVICES=0 $AOMP/bin/offload-arch -c env ROCR_VISIBLE_DEVICES=1 $AOMP/bin/offload-arch -c $AOMP/bin/offload-arch -v
Can you switch to using clang++ (instead of aompcc)? Both clang and clang++ now support the --offload-arch flag. I believe we have a bug in the aompcc script for multiple architectures. Compile with this command:
$AOMP/bin/clang++ --offload-arch gfx900 --offload-arch gfx906 -O3 main.cpp -o test
Thanks again for your help.
Here are tests of offload-arch
script:
Login@Machine:~> $AOMP/bin/aompcc --version
13.0-4
Login@Machine:~> $AOMP/bin/offload-arch -c
gfx906 sramecc- xnack-
Login@Machine:~> env ROCR_VISIBLE_DEVICES=0 $AOMP/bin/offload-arch -c
gfx906 xnack-
Login@Machine:~> env ROCR_VISIBLE_DEVICES=1 $AOMP/bin/offload-arch -c
gfx906 sramecc- xnack-
Login@Machine:~> $AOMP/bin/offload-arch -v
gfx906 VEGA20 1002:66AF amdgcn-amd-amdhsa
gfx900 VEGA10 1002:687F amdgcn-amd-amdhsa
Login@Machine:~> env ROCR_VISIBLE_DEVICES= $AOMP/bin/offload-arch -c # Also for any other non-existent device index
Segmentation fault (core dumped)
And here clang++ test you suggested. I'm probably (with AOMP 13.0-4) behind on clang version.
Login@Machine:~> $AOMP/bin/clang++ --offload-arch gfx900 --offload-arch gfx906 -O3 main.cpp -o test
clang-13: error: unsupported option '--offload-arch'
clang-13: error: unsupported option '--offload-arch'
clang-13: error: no such file or directory: 'gfx900'
clang-13: error: no such file or directory: 'gfx906'
No problem, I'm happy to help. Wish guys from HIP and OpenCL on ROCm were as responsive. :) (but I understand, its huge project and its still quite early)
EDIT: Just for completeness here is rocminfo_log.txt to confirm machine configuration.
Hey @FilipVaverka,
"=" is missing between "offload-arch" flag and its value "gfx906". Is it a copying mistake while pasting the commands here?
Login@Machine:~> $AOMP/bin/clang++ --offload-arch gfx900 --offload-arch gfx906 -O3 main.cpp -o test
Otherwise, please use the following command and let us know the output: $AOMP/bin/clang -fopenmp --offload-arch=gfx900 --offload-arch=gfx906 -O3 main.cpp -o test
AOMP 13.0-4 does support this option.
Also, please let us know the output of "offload-arch -c -v"? It is supposed to print all details (including target features) of all active GPUs in the system.
Sorry, that was it. I can compile the binary with
Login@Machine:~> $AOMP/bin/clang++ -fopenmp --offload-arch=gfx900 --offload-arch=gfx906 -O3 main.cpp -o test
However, behavior of "test" binary is the same, it runs with ROCR_VISIBLE_DEVICES=1
and fails with other GPU as
Login@Machine:~> ROCR_VISIBLE_DEVICES=0 ./test
Possible gpu arch mismatch: device:gfx900, image:gfx906 please check compiler flag: -march=<gpu>
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted (core dumped)
Interesting, "offload-arch -c -v" seems to indicate some issue with gfx900 GPU:
Login@Machine:~> $AOMP/bin/offload-arch -c -v
gfx906 VEGA20 1002:66AF amdgcn-amd-amdhsa sramecc- xnack-
gfx900 VEGA10 1002:687F amdgcn-amd-amdhsa HSAERROR-INITIALIZATION
Although I don't seem to have any issues with it otherwise (OpenCL ROCm stack works for example).
thanks this helps a lot. It appears we may have two problems. The first is the HSA error on vega 10 with -c. The -c option is the only option in offload-arch that uses HSA. But HSA is needed by the openmp runtime so that is why we are failing to run the application. The 2nd is that when you mask off the GFX906 with .._DEVICES=1 you are seeing strange output " gfx906 sramecc- xnack-" . It should just say gfx900. We need to test on native gfx900. It appears that HSA init is getting called twice and failing.
The openmp runtime is calling "offload-arch -c" and returning bad information. If it returned the correct information, the runtime would be able to choose the correct image.
I hope we can get this fixed in 13.0-5 which will be out by end of the month (July 2021). We have another bug wherein the use of rocm profiler is failing because the runtime is trapping stdout for offload-arch. So we need to move offload-arch to a library call which is a pretty big fix.
Thanks for your patience. For the time being just compile one image and use the mask to select the correct one.