aomp icon indicating copy to clipboard operation
aomp copied to clipboard

Build "fat binary" with multiple GPU archs

Open FilipVaverka opened this issue 4 years ago • 13 comments

Is it possible to build "fat binary" which is compatible with multiple GPU architectures? I'm able to build binary with OpenMP target offload for single GPU architecture (such as AOMP_GPU=gfx900), but such a binary fails to run on gfx906 hardware (as expected). Is there any way to specify multiple GPU architectures to be included in the binary?

FilipVaverka avatar Aug 31 '20 17:08 FilipVaverka

This is unimplemented but possible. It'll involve generating code for N architectures and embedding them all in the host binary.

It would be more difficult to support running on various different GPU architectures at the same time, e.g. a machine with some gfx803 and some gfx906. Everything is more difficult with cross vendor too, so nvptx64 + amdgcn in one binary would be more challenging to implement.

Is this the simpler build once, deploy to various homogeneous machines use case?

JonChesterfield avatar Sep 01 '20 10:09 JonChesterfield

I think the "build once, deploy everywhere" is priority (at least for client applications). However, I hit the issue on my development machine, which has 2 GPUs (RX Vega - gfx900 and Radeon VII - gfx906). In this case OpenMP will report two devices available, but only one can be used as there is no binary for the other - here it could be better to report only those devices which can actually be used.

FilipVaverka avatar Sep 01 '20 10:09 FilipVaverka

There are three enhancements here in order of least difficult to implement to most difficult. 1. Fat binary for the same architecture but different GPUs and features. 2) Two architectures such as amdgcn and nvptx64 but only one is active for execution and 3) the combination of two concurrent devices with different architectures or different GPUs during execution. The last would require device type management in the OpenMP standard which is not expected till at least OpenMP 6.0. I would like to see a strong use case for the third to take to the OpenMP Language committee.

gregrodgers avatar Oct 26 '20 13:10 gregrodgers

This is targeted for next release aomp 13.0-3

gregrodgers avatar Apr 20 '21 12:04 gregrodgers

AOMP 13.0-3 will only support multiple archs in the binary, not multiple archs concurrently.

gregrodgers avatar Apr 20 '21 12:04 gregrodgers

In aomp 13.0-4 you can build a multi-arch binary with multiple --offload-arch flags.

gregrodgers avatar Jul 06 '21 12:07 gregrodgers

How does mapping between devices and images work? For example, I can now compile my code as

aompcc --offload-arch gfx900 --offload-arch gfx906 -O3 main.cpp -o test

However, resulting binary seems to be able to run only on "gfx906" as I can run successfully it with

ROCR_VISIBLE_DEVICES=1 ./test

but with 'ROCR_VISIBLE_DEVICES=0', which is "gfx900" I get

Possible gpu arch mismatch: device:gfx900, image:gfx906 please check compiler flag: -march= Libomptarget error: Unable to generate entries table for device id 0. Libomptarget error: Failed to init globals on device 0 Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings. Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only. Libomptarget fatal error 1: failure of target construct while offloading is mandatory

Now, if I build only with --offload-arch gfx900 I cant run the binary regardless of ROCR_VISIBLE_DEVICES and I get

WARNING: Runtime capabilities do NOT meet any image requirements. So device offloading is now disabled. Runtime capabilities : gfx906 xnack- Image 0 requirements : gfx900

or

WARNING: Runtime capabilities do NOT meet any image requirements. So device offloading is now disabled. Runtime capabilities : gfx906 sramecc- xnack- Image 0 requirements : gfx900

FilipVaverka avatar Jul 10 '21 14:07 FilipVaverka

Thank you for your patience. We just reviewed your issue in our weekly meeting. The first thing to mention that I believe you are aware of is that you cannot offload to two different device types in the same application instance. Yes, you should be able to build for multiple architectures and isolate the GPUs using ROCR_VISIBLE_DEVICES as you have tried.

We do not have a machine with two different cards for this type of testing and appreciate you helping us get this work.

Can you run these four commands and show us the output.

$AOMP/bin/offload-arch -c env ROCR_VISIBLE_DEVICES=0 $AOMP/bin/offload-arch -c env ROCR_VISIBLE_DEVICES=1 $AOMP/bin/offload-arch -c $AOMP/bin/offload-arch -v

Can you switch to using clang++ (instead of aompcc)? Both clang and clang++ now support the --offload-arch flag. I believe we have a bug in the aompcc script for multiple architectures. Compile with this command:

$AOMP/bin/clang++ --offload-arch gfx900 --offload-arch gfx906 -O3 main.cpp -o test

Thanks again for your help.

gregrodgers avatar Jul 12 '21 14:07 gregrodgers

Here are tests of offload-arch script:

Login@Machine:~> $AOMP/bin/aompcc --version
13.0-4
Login@Machine:~> $AOMP/bin/offload-arch -c
gfx906   sramecc- xnack-
Login@Machine:~> env ROCR_VISIBLE_DEVICES=0 $AOMP/bin/offload-arch -c
gfx906   xnack-
Login@Machine:~> env ROCR_VISIBLE_DEVICES=1 $AOMP/bin/offload-arch -c
gfx906   sramecc- xnack-
Login@Machine:~> $AOMP/bin/offload-arch -v
gfx906 VEGA20 1002:66AF amdgcn-amd-amdhsa
gfx900 VEGA10 1002:687F amdgcn-amd-amdhsa
Login@Machine:~> env ROCR_VISIBLE_DEVICES= $AOMP/bin/offload-arch -c # Also for any other non-existent device index
Segmentation fault (core dumped)

And here clang++ test you suggested. I'm probably (with AOMP 13.0-4) behind on clang version.

Login@Machine:~> $AOMP/bin/clang++ --offload-arch gfx900 --offload-arch gfx906 -O3 main.cpp -o test
clang-13: error: unsupported option '--offload-arch'
clang-13: error: unsupported option '--offload-arch'
clang-13: error: no such file or directory: 'gfx900'
clang-13: error: no such file or directory: 'gfx906'

No problem, I'm happy to help. Wish guys from HIP and OpenCL on ROCm were as responsive. :) (but I understand, its huge project and its still quite early)

EDIT: Just for completeness here is rocminfo_log.txt to confirm machine configuration.

FilipVaverka avatar Jul 12 '21 16:07 FilipVaverka

Hey @FilipVaverka,

"=" is missing between "offload-arch" flag and its value "gfx906". Is it a copying mistake while pasting the commands here?

Login@Machine:~> $AOMP/bin/clang++ --offload-arch gfx900 --offload-arch gfx906 -O3 main.cpp -o test

Otherwise, please use the following command and let us know the output: $AOMP/bin/clang -fopenmp --offload-arch=gfx900 --offload-arch=gfx906 -O3 main.cpp -o test

AOMP 13.0-4 does support this option.

Also, please let us know the output of "offload-arch -c -v"? It is supposed to print all details (including target features) of all active GPUs in the system.

saiislam avatar Jul 13 '21 11:07 saiislam

Sorry, that was it. I can compile the binary with

Login@Machine:~> $AOMP/bin/clang++ -fopenmp --offload-arch=gfx900 --offload-arch=gfx906 -O3 main.cpp -o test

However, behavior of "test" binary is the same, it runs with ROCR_VISIBLE_DEVICES=1 and fails with other GPU as

Login@Machine:~> ROCR_VISIBLE_DEVICES=0 ./test 
Possible gpu arch mismatch: device:gfx900, image:gfx906 please check compiler flag: -march=<gpu>
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted (core dumped)

Interesting, "offload-arch -c -v" seems to indicate some issue with gfx900 GPU:

Login@Machine:~> $AOMP/bin/offload-arch -c -v
gfx906 VEGA20 1002:66AF amdgcn-amd-amdhsa   sramecc- xnack-
gfx900 VEGA10 1002:687F amdgcn-amd-amdhsa   HSAERROR-INITIALIZATION

Although I don't seem to have any issues with it otherwise (OpenCL ROCm stack works for example).

FilipVaverka avatar Jul 13 '21 11:07 FilipVaverka

thanks this helps a lot. It appears we may have two problems. The first is the HSA error on vega 10 with -c. The -c option is the only option in offload-arch that uses HSA. But HSA is needed by the openmp runtime so that is why we are failing to run the application. The 2nd is that when you mask off the GFX906 with .._DEVICES=1 you are seeing strange output " gfx906 sramecc- xnack-" . It should just say gfx900. We need to test on native gfx900. It appears that HSA init is getting called twice and failing.

The openmp runtime is calling "offload-arch -c" and returning bad information. If it returned the correct information, the runtime would be able to choose the correct image.

I hope we can get this fixed in 13.0-5 which will be out by end of the month (July 2021). We have another bug wherein the use of rocm profiler is failing because the runtime is trapping stdout for offload-arch. So we need to move offload-arch to a library call which is a pretty big fix.

Thanks for your patience. For the time being just compile one image and use the mask to select the correct one.

gregrodgers avatar Jul 13 '21 13:07 gregrodgers