AMDMIGraphX MIGraphX execution provider for Triton Inference Server

Can this be done by leveraging the onnxruntime work we already have as a back end?

As a preliminary step, learn to add a Cuda back end, then change it to MIGraphX/ROCm

See https://github.com/triton-inference-server/onnxruntime_backend and https://github.com/triton-inference-server/onnxruntime_backend#onnx-runtime-with-tensorrt-optimization

Documentation for building the back end is at server docs Development Build of Backend or Repository Agent

Nov 06 '23 23:11 bpickrel

Use the following target models for testing:

resnet50 Bert distilgpt2

Babystep the process for this and see what we need/can leverage from existing backends/Execution provider

[x] Get triton installed on a system
[x] Get triton running with an Onnxruntime Backend + CUDA/TensorRT Execution Provider
[x] Determine what's needed MIgraphX EP support for Triton using Onnxruntime backend
[ ] Add MIGraphX EP changes to get Triton Running Onnxruntime as a backend

Nov 07 '23 03:11 TedThemistokleous

Latest update:

I tried pulling and building Triton-inference-server using the Python script mentioned at https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker No luck; it failed with a not-authorized error which I've asked about.
cloned [email protected]:triton-inference-server/onnxruntime_backend.git which uses a cmake build, no success yet; working on it
Installed cuda-toolkit per instructions at https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local Success, I assume you can't do much with Cuda on a machine without an NVidia GPU but it's a dependency for ONNxruntime_backend.

Nov 10 '23 17:11 bpickrel

Note to myself on how I ran an example. This doesn't introduce the execution provider, yet.

Get the triton-inference-server repo git clone [email protected]:triton-inference-server/server.git
Go to the examples directory and fetch the example models

cd ~/Triton-server/server/docs/examples
./fetch_models.sh

Note: model_repository directory != model-repository. We want the one with the underscore. 3. Set the backend choice in the config file for our model nano model_repository/densenet_onnx/config.pbtxt Add the line backend: "onnxruntime" 4. In a different console, same directory, run a prebuilt Docker image of the server docker run --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:23.04-py3 tritonserver --model-repository=/models

this takes a long time to pull the first time
with no --gpus=1 argument, it can run without having Cuda up. It does inference with CPU.
Question--What do I use to specify the Navi GPU?
Question--what do the port numbers do? I can use those numbers for the server, but still have to use --net=host for the client. How can I tell the client to connect to a port or a URL?

In the original console, run a different Docker for the example client docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:22.07-py3-sdk /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

Next steps: how to call the Migraphx execution provider? How to use Cuda? When I get my AWS setup working right, I can run this on an EC2 instance with various GPU configs.

Nov 18 '23 01:11 bpickrel

The above doesn't go in the order of Ted's earlier note. I'm running a prebuilt Docker image of a server before having built my own server.

Nov 18 '23 01:11 bpickrel

Here's what the onnxruntime shared libraries look like, as installed in that server Docker I used above:

root@home-tower:/opt/tritonserver/backends# ll onnxruntime/
total 507608
drwxrwxrwx  3 triton-server triton-server      4096 Apr 18  2023 ./
drwxrwxrwx 13 triton-server triton-server      4096 Apr 18  2023 ../
-rw-rw-rw-  1 triton-server triton-server      1073 Apr 18  2023 LICENSE
drwxrwxrwx  2 triton-server triton-server      4096 Apr 18  2023 LICENSE.openvino/
-rw-rw-rw-  1 triton-server triton-server  21015528 Apr 18  2023 libonnxruntime.so
-rw-rw-rw-  1 triton-server triton-server 420751256 Apr 18  2023 libonnxruntime_providers_cuda.so
-rw-rw-rw-  1 triton-server triton-server    559152 Apr 18  2023 libonnxruntime_providers_openvino.so
-rw-rw-rw-  1 triton-server triton-server     15960 Apr 18  2023 libonnxruntime_providers_shared.so
-rw-rw-rw-  1 triton-server triton-server    548472 Apr 18  2023 libonnxruntime_providers_tensorrt.so
-rw-rw-rw-  1 triton-server triton-server  12953944 Apr 18  2023 libopenvino.so
-rw-rw-rw-  1 triton-server triton-server    288352 Apr 18  2023 libopenvino_c.so
-rw-rw-rw-  1 triton-server triton-server  32005816 Apr 18  2023 libopenvino_intel_cpu_plugin.so
-rw-rw-rw-  1 triton-server triton-server    332096 Apr 18  2023 libopenvino_ir_frontend.so
-rw-rw-rw-  1 triton-server triton-server   3684352 Apr 18  2023 libopenvino_onnx_frontend.so
lrwxrwxrwx  1 triton-server triton-server        11 Apr 18  2023 libtbb.so -> libtbb.so.2
-rw-rw-rw-  1 triton-server triton-server    438832 Apr 18  2023 libtbb.so.2
-rw-rw-rw-  1 triton-server triton-server    689616 Apr 18  2023 libtriton_onnxruntime.so
-rwxrwxrwx  1 triton-server triton-server  22923312 Apr 18  2023 onnx_test_runner*
-rwxrwxrwx  1 triton-server triton-server   3529192 Apr 18  2023 onnxruntime_perf_test*
-rw-rw-rw-  1 triton-server triton-server         7 Apr 18  2023 ort_onnx_version.txt
-rw-rw-rw-  1 triton-server triton-server      1056 Apr 18  2023 plugins.xml

Nov 20 '23 22:11 bpickrel

I'll need to take this over. It looks like what Brian's done works, pulls in Onnxruntime as a backend and then adds the tensorRT/Cuda side. I'll poke around this week/next to get this running with our flavor of Onnxruntime with the ROCm/ MIGraphX EPs.

Nov 20 '23 22:11 TedThemistokleous

Yahtzee

https://github.com/triton-inference-server/onnxruntime_backend/blob/main/tools/gen_ort_dockerfile.py

Nov 20 '23 22:11 TedThemistokleous

Got changes generating and building a dockerfile @bpickrel . One I've got this building I'll pass this back to you to generate the docker and then run it before passing this through another inference.

Changes are up on my fork: https://github.com/TedThemistokleous/onnxruntime_backend/blob/add_migraphx_rocm_onnxrt_eps/tools/gen_ort_dockerfile.py

I found that we have upstream triton images for ROCm we can leverage which I'm currently building.

https://hub.docker.com/r/rocm/oai-triton/tags

So to run the script you just need to run the following off ROCm 5.7 dockerfile with all the triton inference server pieces

python3 tools/gen_ort_dockerfile.py --migraphx-home=/opt/rocm --ort-migraphx --rocm-home=/opt/rocm/ --rocm-version=5.7 --enable-rocm --migraphx-version=rocm-5.7.1 --ort-version=1.7.0  --output=migx_rocm_triton_inf.dockerfile --triton-container=rocm/oai-triton:preview_2023-11-29_182

then

docker build -t migx_rocm_tritron -f migx_rocm_triton_inf.dockerfile .

That should create the docker you would need to run an inference similar to the previous example

Dec 06 '23 19:12 TedThemistokleous

Hitting a failure when attempting to build onnxruntime this way. Currently looking into this.

Dec 06 '23 19:12 TedThemistokleous

Finally some good news after debugging. @bpickrel @causten

I'm able to build a container off the generated dockerfile and get the proper hooks/links pieces to work for an MIGraphX + ROCm

Needed to rework some of the automation used to build this onnxruntime backend which uses a different repo than what we use to perform onnxruntime builds. Had to change the rel-XXXX to just building Onnxruntime main
Needed to add additional pieces from DLM/CI dockerfiles to get Onnxruntime to build

It looks like the built container containers two binaries, one for perf and one for test.

All the library shared objects are there too after popping into the container to take a look

libonnxruntime.so  libonnxruntime.so.main  libonnxruntime_providers_migraphx.so  libonnxruntime_providers_rocm.so  libonnxruntime_providers_shared.so
root@aus-navi3x-02:/opt/onnxruntime/lib#

There's a bin folder that seems to contain binaries we'd use to do the perf. Output of these seems interesting, maybe I should also add hooks/piecs like tensorRT here.

root@aus-navi3x-02:/opt/onnxruntime/bin# ./onnxruntime_perf_test 
perf_test [options...] model_path [result_file]
Options:
        -m [test_mode]: Specifies the test mode. Value could be 'duration' or 'times'.
                Provide 'duration' to run the test for a fix duration, and 'times' to repeated for a certain times. 
        -M: Disable memory pattern.
        -A: Disable memory arena
        -I: Generate tensor input binding (Free dimensions are treated as 1.)
        -c [parallel runs]: Specifies the (max) number of runs to invoke simultaneously. Default:1.
        -e [cpu|cuda|dnnl|tensorrt|openvino|dml|acl|nnapi|coreml|qnn|snpe|rocm|migraphx|xnnpack|vitisai]: Specifies the provider 'cpu','cuda','dnnl','tensorrt', 'openvino', 'dml', 'acl', 'nnapi', 'coreml', 'qnn', 'snpe', 'rocm', 'migraphx', 'xnnpack' or 'vitisai'. Default:'cpu'.
        -b [tf|ort]: backend to use. Default:ort
        -r [repeated_times]: Specifies the repeated times if running in 'times' test mode.Default:1000.
        -t [seconds_to_run]: Specifies the seconds to run for 'duration' mode. Default:600.
        -p [profile_file]: Specifies the profile name to enable profiling and dump the profile data to the file.
        -s: Show statistics result, like P75, P90. If no result_file provided this defaults to on.
        -S: Given random seed, to produce the same input data. This defaults to -1(no initialize).
        -v: Show verbose information.
        -x [intra_op_num_threads]: Sets the number of threads used to parallelize the execution within nodes, A value of 0 means ORT will pick a default. Must >=0.
        -y [inter_op_num_threads]: Sets the number of threads used to parallelize the execution of the graph (across nodes), A value of 0 means ORT will pick a default. Must >=0.
        -f [free_dimension_override]: Specifies a free dimension by name to override to a specific value for performance optimization. Syntax is [dimension_name:override_value]. override_value must > 0
        -F [free_dimension_override]: Specifies a free dimension by denotation to override to a specific value for performance optimization. Syntax is [dimension_denotation:override_value]. override_value must > 0
        -P: Use parallel executor instead of sequential executor.
        -o [optimization level]: Default is 99 (all). Valid values are 0 (disable), 1 (basic), 2 (extended), 99 (all).
                Please see onnxruntime_c_api.h (enum GraphOptimizationLevel) for the full list of all optimization levels.
        -u [optimized_model_path]: Specify the optimized model path for saving.
        -d [CUDA only][cudnn_conv_algorithm]: Specify CUDNN convolution algorithms: 0(benchmark), 1(heuristic), 2(default). 
        -q [CUDA only] use separate stream for copy. 
        -z: Set denormal as zero. When turning on this option reduces latency dramatically, a model may have denormals.
        -i: Specify EP specific runtime options as key value pairs. Different runtime options available are: 
            [OpenVINO only] [device_type]: Overrides the accelerator hardware type and precision with these values at runtime.
            [OpenVINO only] [device_id]: Selects a particular hardware device for inference.
            [OpenVINO only] [enable_npu_fast_compile]: Optionally enabled to speeds up the model's compilation on NPU device targets.
            [OpenVINO only] [num_of_threads]: Overrides the accelerator hardware type and precision with these values at runtime.
            [OpenVINO only] [cache_dir]: Explicitly specify the path to dump and load the blobs(Model caching) or cl_cache (Kernel Caching) files feature. If blob files are already present, it will be directly loaded.
            [OpenVINO only] [enable_opencl_throttling]: Enables OpenCL queue throttling for GPU device(Reduces the CPU Utilization while using GPU) 
            [QNN only] [backend_path]: QNN backend path. e.g '/folderpath/libQnnHtp.so', '/folderpath/libQnnCpu.so'.
            [QNN only] [profiling_level]: QNN profiling level, options: 'basic', 'detailed', default 'off'.
            [QNN only] [rpc_control_latency]: QNN rpc control latency. default to 10.
            [QNN only] [vtcm_mb]: QNN VTCM size in MB. default to 0(not set).
            [QNN only] [htp_performance_mode]: QNN performance mode, options: 'burst', 'balanced', 'default', 'high_performance', 
            'high_power_saver', 'low_balanced', 'low_power_saver', 'power_saver', 'sustained_high_performance'. Default to 'default'. 
            [QNN only] [qnn_context_priority]: QNN context priority, options: 'low', 'normal', 'normal_high', 'high'. Default to 'normal'. 
            [QNN only] [qnn_saver_path]: QNN Saver backend path. e.g '/folderpath/libQnnSaver.so'.
            [QNN only] [htp_graph_finalization_optimization_mode]: QNN graph finalization optimization mode, options: 
            '0', '1', '2', '3', default is '0'.
         [Usage]: -e <provider_name> -i '<key1>|<value1> <key2>|<value2>'

         [Example] [For OpenVINO EP] -e openvino -i "device_type|CPU_FP32 enable_npu_fast_compile|true num_of_threads|5 enable_opencl_throttling|true cache_dir|"<path>""
         [Example] [For QNN EP] -e qnn -i "backend_path|/folderpath/libQnnCpu.so" 

            [TensorRT only] [trt_max_partition_iterations]: Maximum iterations for TensorRT parser to get capability.
            [TensorRT only] [trt_min_subgraph_size]: Minimum size of TensorRT subgraphs.
            [TensorRT only] [trt_max_workspace_size]: Set TensorRT maximum workspace size in byte.
            [TensorRT only] [trt_fp16_enable]: Enable TensorRT FP16 precision.
            [TensorRT only] [trt_int8_enable]: Enable TensorRT INT8 precision.
            [TensorRT only] [trt_int8_calibration_table_name]: Specify INT8 calibration table name.
            [TensorRT only] [trt_int8_use_native_calibration_table]: Use Native TensorRT calibration table.
            [TensorRT only] [trt_dla_enable]: Enable DLA in Jetson device.
            [TensorRT only] [trt_dla_core]: DLA core number.
            [TensorRT only] [trt_dump_subgraphs]: Dump TRT subgraph to onnx model.
            [TensorRT only] [trt_engine_cache_enable]: Enable engine caching.
            [TensorRT only] [trt_engine_cache_path]: Specify engine cache path.
            [TensorRT only] [trt_force_sequential_engine_build]: Force TensorRT engines to be built sequentially.
            [TensorRT only] [trt_context_memory_sharing_enable]: Enable TensorRT context memory sharing between subgraphs.
            [TensorRT only] [trt_layer_norm_fp32_fallback]: Force Pow + Reduce ops in layer norm to run in FP32 to avoid overflow.
         [Usage]: -e <provider_name> -i '<key1>|<value1> <key2>|<value2>'

         [Example] [For TensorRT EP] -e tensorrt -i 'trt_fp16_enable|true trt_int8_enable|true trt_int8_calibration_table_name|calibration.flatbuffers trt_int8_use_native_calibration_table|false trt_force_sequential_engine_build|false'
            [NNAPI only] [NNAPI_FLAG_USE_FP16]: Use fp16 relaxation in NNAPI EP..
            [NNAPI only] [NNAPI_FLAG_USE_NCHW]: Use the NCHW layout in NNAPI EP.
            [NNAPI only] [NNAPI_FLAG_CPU_DISABLED]: Prevent NNAPI from using CPU devices.
            [NNAPI only] [NNAPI_FLAG_CPU_ONLY]: Using CPU only in NNAPI EP.
         [Usage]: -e <provider_name> -i '<key1> <key2>'

         [Example] [For NNAPI EP] -e nnapi -i " NNAPI_FLAG_USE_FP16 NNAPI_FLAG_USE_NCHW NNAPI_FLAG_CPU_DISABLED "
            [SNPE only] [runtime]: SNPE runtime, options: 'CPU', 'GPU', 'GPU_FLOAT16', 'DSP', 'AIP_FIXED_TF'. 
            [SNPE only] [priority]: execution priority, options: 'low', 'normal'. 
            [SNPE only] [buffer_type]: options: 'TF8', 'TF16', 'UINT8', 'FLOAT', 'ITENSOR'. default: ITENSOR'. 
            [SNPE only] [enable_init_cache]: enable SNPE init caching feature, set to 1 to enabled it. Disabled by default. 
         [Usage]: -e <provider_name> -i '<key1>|<value1> <key2>|<value2>' 

         [Example] [For SNPE EP] -e snpe -i "runtime|CPU priority|low" 

        -T [Set intra op thread affinities]: Specify intra op thread affinity string
         [Example]: -T 1,2;3,4;5,6 or -T 1-2;3-4;5-6 
                 Use semicolon to separate configuration between threads.
                 E.g. 1,2;3,4;5,6 specifies affinities for three threads, the first thread will be attached to the first and second logical processor.
                 The number of affinities must be equal to intra_op_num_threads - 1

        -D [Disable thread spinning]: disable spinning entirely for thread owned by onnxruntime intra-op thread pool.
        -Z [Force thread to stop spinning between runs]: disallow thread from spinning during runs to reduce cpu usage.
        -h: help

I've build a container on aus-navi3x-02.amd.com, named migx_rocm_tritron

You should be able to just docker into it with

docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined migx_rocm_tritron

If you want to pick another system checkout my branch and then run the following to generate a new dockerfile

python3 tools/gen_ort_dockerfile.py --ort-migraphx --rocm-home=/opt/rocm/ --rocm-version=6.0 --enable-rocm --migraphx-version=develop --ort-version=main  --output=migx_rocm_triton_inf.dockerfile --triton-container=compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-rel-6.0:88_ubuntu22.04_py3.10_pytorch_release-2.1_011de5c

This uses build 88 of ROCm 6.0, and builds MIGraphX from Develop and Onnxruntime off main using the upstream Microsoft repo.

Upstream changes are here: https://github.com/TedThemistokleous/onnxruntime_backend/blob/add_migraphx_rocm_onnxrt_eps/tools/gen_ort_dockerfile.py

Branch name is add_migraphx_rocm_onnxrt_eps

Dec 22 '23 20:12 TedThemistokleous

Does your Docker contain a Triton server? I thought this was a replacement for the example docker image that had the server included.

Jan 08 '24 18:01 bpickrel

I thought it does but it looks like I'm wrong here. I did see shared binaries of MIgraphX, Onnxrt and MIGraphX & ROCm EPs with some other scripts. Taking a look at the front end part of triton and an initial read, I we need to enable/a hook added via reading their repo. It sounds like we were using the Nvidia front end in the example instead of just using Onnxruntime + building the back end support.

"By default, build.py does not enable any of Triton's optional features but you can enable all features, backends, and repository agents with the --enable-all flag. The -v flag turns on verbose output."

D'oh!

Looks like wishful thinking on my part hoping we could just invoke the one container. I sense the Docker backend build script generated is then leveraged by the frontend to add the missing components get built in. I'll have to dig more on the front end unless you see different hooks for onnxruntime in the main repo.

Jan 09 '24 04:01 TedThemistokleous

Looks like we need to invoke the container build done in the backend to the front end server by selecting things. The server builds the front end and then other components from the initial repo. all seems to be done through cmake which then through a series of flags, leverages the build script build.py from the server repo.

I've pushed up the changes to the onnxruntime_backend (pr 231 in that repo) and from the triton inference serve repo I'm invoking teh following after adding changes to their cmake script to incorperate ROCm/MIGraphX

python build.py --no-container-pull --enable-logging --enable-stats --enable-tracing --enable-rocm --endpoint=grpc --endpoint=http --backend=onnxruntime:pull/231/head

Previous attempts got me to where it looks like we were using the gen_ort_docker.py script but failing on tensorRT build related items and thats what got me looking. Brian and I have been going back and forth to confirm if we're seeing the same failures and pair debugging.

I've now forked the server repo and run a custom build based on their requirements in their documentatin found here: https://github.com/TedThemistokleous/server/tree/add_migraphx_rocm_hooks/docs/customization_guide

I've also added the server repo side changes which I'm testing into: add_migraphx_rocm_hooks off my fork.

Jan 10 '24 02:01 TedThemistokleous

Don't we need --enable-gpu ?

Jan 10 '24 17:01 bpickrel

Here's a tidbit from the issues page:

**By default, if GPU support is enabled, the base image is set to the Triton NGC min container, otherwise ubuntu:22.04 image is used for CPU-only build.

If you'd like to create a Triton container that is based on other image, you can set the flag --base, image=<image_name> when running build.py.**

Jan 10 '24 21:01 bpickrel

Hitting errors with cuda libraries now. I think I need to start adding hipify links to the API if I'm not including --enable-gpu as a flag

erver/_deps/repo-backend-src/src -I/tmp/tritonbuild/tritonserver/build/triton-server/_deps/repo-core-src/include -I/tmp/tritonbuild/tritonserver/build/triton-server/_deps/repo-common-src/src/../include -I/tmp/tritonbuild/tritonserver/build/triton-server/_deps/repo-common-src/include -O3 -DNDEBUG -fPIC -Wall -Wextra -Wno-unused-parameter -Werror -MD -MT _deps/repo-backend-build/CMakeFiles/triton-backend-utils.dir/src/backend_output_responder.cc.o -MF CMakeFiles/triton-backend-utils.dir/src/backend_output_responder.cc.o.d -o CMakeFiles/triton-backend-utils.dir/src/backend_output_responder.cc.o -c /tmp/tritonbuild/tritonserver/build/triton-server/_deps/repo-backend-src/src/backend_output_responder.cc
/workspace/src/memory_alloc.cc:27:10: fatal error: cuda_runtime_api.h: No such file or directory
   27 | #include <cuda_runtime_api.h>
      |          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.

Running things with
python build.py --no-container-pull --enable-logging --enable-stats --enable-tracing --enable-rocm --enable-gpu --endpoint=grpc --endpoint=http --backend=onnxruntime:pull/231/head

Jan 12 '24 14:01 TedThemistokleous

Just found this article on hipify-clang and hipify-perl. There's a real difference: hipify-perl does string substitutions, while hipify-clang uses a semantic parser (i.e. smarter for complex programs)

https://www.admin-magazine.com/HPC/Articles/Porting-CUDA-to-HIP

Jan 30 '24 00:01 bpickrel

I re-ran the tutorial examples as described in my Nov. 17 comment on AWS, both to check out requirements for an AWS instance and to make sure that nothing has happened to break the examples. The tutorial still works with the same instructions, with some extra steps required for provisioning the AWS instance.

BUT I'm still getting a message that the Cuda GPU wasn't found, even though I specified instances that have a GPU. Need to investigate why.

Here are notes to myself to help streamline provisioning:

I used instance type g3.8xlarge to run the server container. A smaller instance of type g4ad.4xlarge ran out of disk space for Docker. I haven't tried running server and client in different containers.
Couldn't get new SHA key to work with git, so I copied the working key files from my other computer. Must set permissions for the private key file to 600.
May need to install docker and/or git. This seems to vary with the AWS instance type what's already installed, and whether it runs on AWS Linux or Ubuntu.
May need to use yum instead of apt to install programs. This seems to vary with the AWS instance type too.
May need to start the Docker daemon with service docker start
May need to run git and docker with sudo. Need to investigate the permissions requirement, but sudo works.

Jan 31 '24 18:01 bpickrel

Still trying to install the NVIDIA drivers in the AWS instance so that the Triton server will actually use the GPU. There are instructions at grid-driver but I'm currently trying to set credentials so that the aws command will work. Need to find out what GRID is and if it's what I want. Update: this page says that Tesla drivers, not Grid drivers, are for ML and other computational tasks. Update #2: this page gives a list that shows the g3.8xlarge does come with Tesla drivers installed. Back to working out why Triton says it can't find it.

Update #3: driver solved. They lied--the instance did not have a driver, but it can be installed with sudo apt install nvidia-driver-535; nvidia-smi But now it runs out of disk space when I run inference.

Jan 31 '24 23:01 bpickrel

Struggled a bit with trying to port over what onnxruntime was doing with hipify as they perform a custom replace after the initial hipify step for pieces of onnxruntime to compile.

Applying things to the triton-inference server leads into a non obvious rabbit hole, and takes away from the task at hand. I've asked Jeff Daily for help here since he's more familiar in how best to get things hipified/integrated into CMake.

In the meantime, I've gone over multiple files and run hipify-perl over them in the trinton server repo as well as manually rename every item with TRITON_ENABLE_GPU in the code for now. The intent here is to get a working compile before cleaning things up.

I've not hit a point with a few more CMake changes where I am compiling and just failing on the link step

[ 53%] Linking CXX executable memory_alloc
/opt/conda/envs/py_3.9/lib/python3.9/site-packages/cmake/data/bin/cmake -E cmake_link_script CMakeFiles/memory_alloc.dir/link.txt --verbose=0
/usr/bin/ld: CMakeFiles/memory_alloc.dir/memory_alloc.cc.o: in function `(anonymous namespace)::ResponseRelease(TRITONSERVER_ResponseAllocator*, void*, void*, unsigned long, TRITONSERVER_memorytype_enum, long)':
memory_alloc.cc:(.text+0xa09): undefined reference to `hipSetDevice'
/usr/bin/ld: memory_alloc.cc:(.text+0xa18): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text+0xb14): undefined reference to `hipFree'
/usr/bin/ld: CMakeFiles/memory_alloc.dir/memory_alloc.cc.o: in function `(anonymous namespace)::ResponseAlloc(TRITONSERVER_ResponseAllocator*, char const*, unsigned long, TRITONSERVER_memorytype_enum, long, void*, void**, void**, TRITONSERVER_memorytype_enum*, long*)':
memory_alloc.cc:(.text+0x11df): undefined reference to `hipSetDevice'
/usr/bin/ld: memory_alloc.cc:(.text+0x11fe): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text+0x1479): undefined reference to `hipMalloc'
/usr/bin/ld: CMakeFiles/memory_alloc.dir/memory_alloc.cc.o: in function `(anonymous namespace)::gpu_data_deleter::{lambda(void*)#1}::operator()((anonymous namespace)::gpu_data_deleter) const [clone .constprop.0]':
memory_alloc.cc:(.text+0x1848): undefined reference to `hipSetDevice'
/usr/bin/ld: memory_alloc.cc:(.text+0x185b): undefined reference to `hipFree'
/usr/bin/ld: memory_alloc.cc:(.text+0x18ca): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text+0x1959): undefined reference to `hipGetErrorString'
/usr/bin/ld: CMakeFiles/memory_alloc.dir/memory_alloc.cc.o: in function `main':
memory_alloc.cc:(.text.startup+0x363c): undefined reference to `hipMemcpy'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x3690): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x36f0): undefined reference to `hipMemcpy'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4313): undefined reference to `hipSetDevice'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x433e): undefined reference to `hipMalloc'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4371): undefined reference to `hipMemcpy'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4392): undefined reference to `hipMalloc'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x43c8): undefined reference to `hipMemcpy'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4439): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4a45): undefined reference to `hipGetErrorString'
/usr/bin/ld: memory_alloc.cc:(.text.startup+0x4b41): undefined reference to `hipGetErrorString'
collect2: error: ld returned 1 exit status

Need to sort out if I'm missing some sort of dependency or DIR here as this server image builds a multi and simple version.

Another thing I've noticed which wasn't obvious when building the Onnxruntime_backend portion is that they explicity add onnxruntime EP hooks /setup code for the target EP found in onnxruntime_backend/src/onnxruntime.cc

 // Add execution providers if they are requested.
  // Don't need to ensure uniqueness of the providers, ONNX Runtime
  // will check it.

  // GPU execution providers
#ifdef TRITON_ENABLE_GPU
  if ((instance_group_kind == TRITONSERVER_INSTANCEGROUPKIND_GPU) ||
      (instance_group_kind == TRITONSERVER_INSTANCEGROUPKIND_AUTO)) {
    triton::common::TritonJson::Value optimization;
    if (model_config_.Find("optimization", &optimization)) {
      triton::common::TritonJson::Value eas;
      if (optimization.Find("execution_accelerators", &eas)) {
        triton::common::TritonJson::Value gpu_eas;
        if (eas.Find("gpu_execution_accelerator", &gpu_eas)) {
          for (size_t ea_idx = 0; ea_idx < gpu_eas.ArraySize(); ea_idx++) {
            triton::common::TritonJson::Value ea;
            RETURN_IF_ERROR(gpu_eas.IndexAsObject(ea_idx, &ea));
            std::string name;
            RETURN_IF_ERROR(ea.MemberAsString("name", &name));
#ifdef TRITON_ENABLE_ONNXRUNTIME_TENSORRT
            if (name == kTensorRTExecutionAccelerator) {

This shouldn't be a difficult task to add in MIGraphX and ROCm EP calls as this should be simply mapped to the same options used in the standard Onnxruntime API.

This came up when I was doing a search for the TRITON_ENABLE_GPU compile time define set as this was originally gating functionality on the server. Kind of a lucky find here I suppose as I think we would have gotten the server eventually compiled, as well as the backend and probably would have not gotten any output or errors in inference.

Feb 02 '24 05:02 TedThemistokleous

Got further along in the process with some suggestions from Paul. Removed hiprtc hooks and using just hip::host now for linkage. Getting father in the compile.

Running up against some issues with test being compiled as well as warnings from unused returns (nodiscard) on a few hip Function calls.

eg of one in particular

/workspace/src/shared_memory_manager.cc: In function ‘TRITONSERVER_Error* triton::server::{anonymous}::OpenCudaIPCRegion(const hipIpcMemHandle_t*, void**, int)’:
/workspace/src/shared_memory_manager.cc:205:8: error: unused variable ‘e’ [-Werror=unused-variable]
  205 |   auto e = hipSetDevice(device_id);
      |        ^

This seems to be the only warning but will probably try with the --warnings-not-errors flag

I've commented out Unit tests right now since there seems to be a similar fail with one of the test units (sequence.cc) but hoping the compile flag can resolve that too.

Feb 06 '24 04:02 TedThemistokleous

Seems to be an issue now with libevent.

[100%] Linking CXX executable tritonserver
/opt/conda/envs/py_3.9/lib/python3.9/site-packages/cmake/data/bin/cmake -E cmake_link_script CMakeFiles/main.dir/link.txt --verbose=0
/usr/bin/ld: cannot find -levent_extra?: No such file or directory
collect2: error: ld returned 1 exit status

Libevent seems to be installed but not sure why this isn't being linked correctly still. More digging required.

Feb 06 '24 04:02 TedThemistokleous

Running the example on Amazon Web Services

The following still needs to be streamlined, but it works. This is the same example as before, but running on an AWS instance instead of one of our host machines. The biggest difference is that the AWS instance has an Nvidia GPU and runs a Cuda driver--before, Triton defaulted to using a CPU.

Note that this process requires both server and client Docker containers to be run on the same AWS instance and network with each other using --network=host . I have yet to work out how to open up the AWS instance and the server to Internet requests.

Also, I haven't explained how to create and connect to an AWS instance. Create an instance with the following attributes

Instance name: your choice
AMI Image: Ubuntu (AWS Linux will work, but the details will be different. For one thing, it uses yum instead of apt package manager.)
Type: g3.8xlarge Any other type that includes a GPU should work, but this is the one that's proven.
- The listing in the AWS store says this instance type comes with Nvidia drivers installed, but it's not true. We will have to install them.
Key pair: bpickrel
Configure storage: 128 GiB This is important since the default container doesn't have enough disk space for 2 Docker containers

Start a console

Add an ssh key pair, not shown. I did it by cutting and pasting existing keys.

     sudo apt-get install -y docker  docker.io  gcc make linux-headers-$(uname -r)  awscli
     echo that installed everything except nvidia-container-toolkit and  cuda-drivers which have to be fetched from Nvidia distributions.  A dialog appears to restart drivers \(accept defaults\)
     echo Fetch the Triton repository, go into it and fetch the models and backend config
     git clone [email protected]:triton-inference-server/server.git
     cd server/docs/examples
     ./fetch_models.sh
     echo "backend: \"onnxruntime\"" | tee -a model_repository/densenet_onnx/config.pbtxt

   echo   Go back to home directory \(optional\) to install nvidia-container-toolkit and CUDA drivers.  We will have to reboot afterwards.
   
   
     cd ~
     curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg   && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |     sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |     sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
     sudo apt update
     echo Install container toolkit. A dialog appears to restart drivers \(accept defaults\)
     sudo apt  install -y nvidia-container-toolkit
     sudo nvidia-ctk runtime configure --runtime=docker
     sudo apt-get upgrade -y linux-aws
     cat << EOF | sudo tee --append /etc/modprobe.d/blacklist.conf
     blacklist vga16fb
     blacklist nouveau
     blacklist rivafb
     blacklist nvidiafb
     blacklist rivatv
     EOF
     echo for grub add the following line    GRUB_CMDLINE_LINUX=\"rdblacklist=nouveau\"
     # sudo nano /etc/default/grub
     echo "GRUB_CMDLINE_LINUX=\"rdblacklist=nouveau\"" | sudo tee -a /etc/default/grub
     echo
     echo   Installation of the CUDA drivers for the server.
     echo see https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html~ for the following
     echo we have already installed sudo apt-get install linux-headers-$\(uname -r\)
     distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
     wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb
     sudo dpkg -i cuda-keyring_1.0-1_all.deb
     echo a warning tells us to do the following
     sudo apt-key del 7fa2af80
     sudo apt-get update
     echo Install CUDA drivers.  A dialog appears to restart drivers \(accept defaults\) but another message tells us we should reboot.
     sudo apt-get -y install cuda-drivers
     sudo shutdown now

In a new console, after rebooting the instance. Start the server

     cd ~/server/docs/examples
     sudo docker run --rm --net=host --gpus=1 -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:23.04-py3 tritonserver --model-repository=/models

In a second console, run the client from any directory location

     echo this is second console
     sudo docker run -it --rm  --net=host  nvcr.io/nvidia/tritonserver:22.07-py3-sdk  /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

Update: removed the flags --runtime=nvidia --gpus all from the last (sudo docker) command line here. The client should have no need of a GPU and be able to run from the default runtime.

Feb 08 '24 21:02 bpickrel

Good news finally. Got the server built now using ROCm and hip::host() libs

Finally getting to the backend build. Need to sort out additional script pieces tomorrow for the onnxruntime backend and how the server build scripts interface with it

Cloning into 'onnxruntime'...
remote: Enumerating objects: 40, done.
remote: Counting objects: 100% (40/40), done.
remote: Compressing objects: 100% (33/33), done.
remote: Total 40 (delta 11), reused 20 (delta 2), pack-reused 0
Receiving objects: 100% (40/40), 56.45 KiB | 1.53 MiB/s, done.
Resolving deltas: 100% (11/11), done.
remote: Enumerating objects: 536, done.
remote: Counting objects: 100% (536/536), done.
remote: Compressing objects: 100% (248/248), done.
remote: Total 516 (delta 332), reused 394 (delta 222), pack-reused 0
Receiving objects: 100% (516/516), 126.90 KiB | 1.57 MiB/s, done.
Resolving deltas: 100% (332/332), completed with 15 local objects.
From https://github.com/triton-inference-server/onnxruntime_backend
 * [new ref]         refs/pull/231/head -> tritonbuildref
Switched to branch 'tritonbuildref'
CMake Error at CMakeLists.txt:369:
  Parse error.  Expected a newline, got identifier with text
  "TRITON_ENABLE_ROCM".


-- Configuring incomplete, errors occurred!
error: build failed

Feb 14 '24 03:02 TedThemistokleous

Hello, I am really curious to know if Triton Inference Server can work on AMD GPUS ?

I always thought it would only work on Nvidia GPUs.

Feb 15 '24 14:02 MatthieuToulemont

@MatthieuToulemont We thought so too.

Feb 15 '24 15:02 TedThemistokleous

Hello, I am really curious to know if Triton Inference Server can work on AMD GPUS ?

I always thought it would only work on Nvidia GPUs.

The existence of this issue should give you your answer. Triton is designed to allow it, but it has not yet been proved in practice. Triton allows the user to specify a backend at server startup time, and a backend can be built with a specified execution provider. We're trying to build and demonstrate this configuration with MigraphX as the execution provider. MigraphX is the inference engine for AMD GPUs.

Feb 15 '24 17:02 bpickrel

Just did a sanity check on this as was still having issues with the backend piece.

Looks like base container of the server builds okay with ROCm. Now need to add in Onnxruntime piece and figure out why we're still pulling the nvidia container instead of uinsg the container specified here as BASE_IMAGE

Sending build context to Docker daemon  211.6MB
Step 1/10 : ARG TRITON_VERSION=2.39.0
Step 2/10 : ARG TRITON_CONTAINER_VERSION=23.10
Step 3/10 : ARG BASE_IMAGE=rocm/pytorch:rocm6.0_ubuntu22.04_py3.9_pytorch_2.0.1
Step 4/10 : FROM ${BASE_IMAGE}
 ---> 08497136e834
Step 5/10 : ARG TRITON_VERSION
 ---> Using cache
 ---> 2adf67eb6205
Step 6/10 : ARG TRITON_CONTAINER_VERSION
 ---> Using cache
 ---> c01b8b62ced1
Step 7/10 : COPY build/ci /workspace
 ---> 8db6b80fa205
Step 8/10 : WORKDIR /workspace
 ---> Running in f69a8e5bc47d
Removing intermediate container f69a8e5bc47d
 ---> de375e972281
Step 9/10 : ENV TRITON_SERVER_VERSION ${TRITON_VERSION}
 ---> Running in 057ba8e50258
Removing intermediate container 057ba8e50258
 ---> 2f43f5ad165e
Step 10/10 : ENV NVIDIA_TRITON_SERVER_VERSION ${TRITON_CONTAINER_VERSION}
 ---> Running in 2a71ecb47c02
Removing intermediate container 2a71ecb47c02
 ---> 54f780d1e798
Successfully built 54f780d1e798
Successfully tagged tritonserver_cibase:latest

Feb 17 '24 05:02 TedThemistokleous

Seeing an odd error now with the Onnxruntime build. Resolved a few issues when starting the ORT build

Initial one was the dockerfile wasn't being generated due to a slew of changes backed up. Once resolved now I'm seeing this when build off 6.0.2 using an released container with torch.

  include could not find requested file:

    ROCMHeaderWrapper

Feb 23 '24 21:02 TedThemistokleous

Getting an ORT build now (step 27). Tail end placement of libs seems to have changed. Sorting this out before backend completes

 => [26/41] WORKDIR /workspace/onnxruntime                                                                                                                                                                                                                                                                                           0.0s
 => [27/41] RUN ./build.sh --config Release --skip_submodule_sync --parallel --build_shared_lib         --build_dir /workspace/build --cmake_extra_defines CMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++  --update --build --use_rocm --allow_running_as_root --rocm_version "6.0.2" --rocm_home "/opt/rocm/" --use_migraphx --m  3265.4s
 => [28/41] WORKDIR /opt/onnxruntime                                                                                                                                                                                                                                                                                                 0.0s
 => [29/41] RUN mkdir -p /opt/onnxruntime &&         cp /workspace/onnxruntime/LICENSE /opt/onnxruntime &&         cat /workspace/onnxruntime/cmake/external/onnx/VERSION_NUMBER > /opt/onnxruntime/ort_onnx_version.txt                                                                                                             0.4s
 => [30/41] RUN mkdir -p /opt/onnxruntime/include &&         cp /workspace/onnxruntime/include/onnxruntime/core/session/onnxruntime_c_api.h         /opt/onnxruntime/include &&         cp /workspace/onnxruntime/include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h         /opt/onnxruntime/include &&     0.4s
 => [31/41] RUN mkdir -p /opt/onnxruntime/lib &&         cp /workspace/build/Release/libonnxruntime_providers_shared.so         /opt/onnxruntime/lib &&         cp /workspace/build/Release/libonnxruntime.so         /opt/onnxruntime/lib                                                                                           0.4s
 => [32/41] RUN mkdir -p /opt/onnxruntime/bin &&         cp /workspace/build/Release/onnxruntime_perf_test         /opt/onnxruntime/bin &&         cp /workspace/build/Release/onnx_test_runner         /opt/onnxruntime/bin &&         (cd /opt/onnxruntime/bin && chmod a+x *)                                                     0.4s
 => [33/41] RUN cp /workspace/build/Release/libonnxruntime_providers_rocm.so         /opt/onnxruntime/lib                                                                                                                                                                                                                            1.1s
 => ERROR [34/41] RUN cp /workspace/onnxruntime/include/onnxruntime/core/providers/migraphx/migraphx_provider_factory.h         /opt/onnxruntime/include &&         cp /workspace/build/Release/libonnxruntime_providers_migraphx.so         /opt/onnxruntime/lib                                                                    0.4s
------
 > importing cache manifest from tritonserver_onnxruntime:
------
------
 > importing cache manifest from tritonserver_onnxruntime_cache0:
------
------
 > importing cache manifest from tritonserver_onnxruntime_cache1:
------
------
 > [34/41] RUN cp /workspace/onnxruntime/include/onnxruntime/core/providers/migraphx/migraphx_provider_factory.h         /opt/onnxruntime/include &&         cp /workspace/build/Release/libonnxruntime_providers_migraphx.so         /opt/onnxruntime/lib:
0.362 cp: cannot stat '/workspace/onnxruntime/include/onnxruntime/core/providers/migraphx/migraphx_provider_factory.h': No such file or directory
------

Feb 27 '24 03:02 TedThemistokleous

AMDMIGraphX AMDMIGraphX copied to clipboard

MIGraphX execution provider for Triton Inference Server

Note to myself on how I ran an example. This doesn't introduce the execution provider, yet.

Next steps: how to call the Migraphx execution provider? How to use Cuda? When I get my AWS setup working right, I can run this on an EC2 instance with various GPU configs.

Running the example on Amazon Web Services

AMDMIGraphX
AMDMIGraphX copied to clipboard