AMDMIGraphX
AMDMIGraphX copied to clipboard
Add weight streaming
Add weight streaming to allow running of large models on GPUs with low memory.
Closes #3156.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 92.17%. Comparing base (
ddc4c0c) to head (0fc282e).
Additional details and impacted files
@@ Coverage Diff @@
## develop #3222 +/- ##
========================================
Coverage 92.17% 92.17%
========================================
Files 512 512
Lines 21387 21387
========================================
Hits 19714 19714
Misses 1673 1673
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
17 of 18 test failures right now for test/schedule_test is due to a check of EXPECT(not t.has_stream(one));. This is failing due to the fact that I have added both literals and their associated copy_to_gpu to a stream. Should literals thus not be included on the stream?
Doing some very basic perf testing adding the literals onto the stream with copies seems to very marginally improve performance. Still not sure if the tests can be ignored since this is a new feature.
Seems to be stuck on a Windows build for some reason? Looking into why but would appreciate help if anybody has any pointers
Removed call to make_shared_array to avoid slowdown due to std::copy.
Running ./build/bin/driver perf models/resnet50-v2-7.onnx --enable-weight-streaming:
Before:
Summary:
hip::copy_to_gpu: 55.82ms / 117 = 0.477094ms, 61%
@literal: 31.1419ms / 117 = 0.26617ms, 34%
gpu::miopen_fusion: 1.33101ms / 12 = 0.110917ms, 2%
gpu::code_object::mlir_convolution_add_relu: 0.784531ms / 17 = 0.0461489ms, 1%
gpu::convolution: 0.573863ms / 4 = 0.143466ms, 1%
load: 0.317283ms / 191 = 0.00166117ms, 1%
gpu::code_object::mlir_convolution_add: 0.278956ms / 8 = 0.0348696ms, 1%
gpu::code_object::mul_add_relu_kernel: 0.244727ms / 8 = 0.0305909ms, 1%
gpu::code_object::mlir_convolution: 0.236977ms / 5 = 0.0473954ms, 1%
gpu::code_object::mlir_convolution_mul_add_relu: 0.232219ms / 4 = 0.0580549ms, 1%
multibroadcast: 0.206982ms / 67 = 0.00308928ms, 1%
gpu::gemm: 0.167417ms / 1 = 0.167417ms, 1%
gpu::code_object::mlir_convolution_mul_add_add_relu: 0.124884ms / 3 = 0.0416281ms, 1%
gpu::pooling: 0.105871ms / 1 = 0.105871ms, 1%
gpu::code_object::add_relu_noop_concat_noop_kernel: 0.085841ms / 3 = 0.0286137ms, 1%
gpu::code_object::mul_kernel: 0.0637746ms / 2 = 0.0318873ms, 1%
gpu::code_object::concat_add_relu_kernel: 0.0290406ms / 1 = 0.0290406ms, 1%
gpu::code_object::mul_add_add_relu_mul_reduce_sum_kernel: 0.02897ms / 1 = 0.02897ms, 1%
gpu::code_object::add_relu_kernel: 0.0275658ms / 1 = 0.0275658ms, 1%
step: 0.0107792ms / 3 = 0.00359307ms, 1%
hip::hip_copy_literal: 0.009445ms / 2 = 0.0047225ms, 1%
@param: 0.0049534ms / 2 = 0.0024767ms, 1%
hip::hip_allocate_memory: 0.0043382ms / 1 = 0.0043382ms, 1%
check_context::migraphx::gpu::context: 0.0036888ms / 1 = 0.0036888ms, 1%
reshape_lazy: 0.003683ms / 1 = 0.003683ms, 1%
Batch size: 1
Rate: 19.0196 inferences/sec
Total time: 52.5772ms
Total instructions time: 91.8387ms
Overhead time: 0.135608ms, -39.2615ms
Overhead: 0%, -75%
After:
Summary:
hip::copy_to_gpu: 6.37803ms / 117 = 0.0545131ms, 60%
gpu::miopen_fusion: 1.12564ms / 12 = 0.0938037ms, 11%
gpu::code_object::mlir_convolution_add_relu: 0.747901ms / 17 = 0.0439942ms, 7%
load: 0.378693ms / 191 = 0.00198269ms, 4%
gpu::convolution: 0.349078ms / 4 = 0.0872696ms, 4%
gpu::code_object::mlir_convolution_add: 0.267458ms / 8 = 0.0334322ms, 3%
multibroadcast: 0.257707ms / 67 = 0.00384638ms, 3%
gpu::code_object::mlir_convolution: 0.225739ms / 5 = 0.0451478ms, 3%
gpu::code_object::mul_add_relu_kernel: 0.220406ms / 8 = 0.0275508ms, 3%
gpu::code_object::mlir_convolution_mul_add_relu: 0.212062ms / 4 = 0.0530154ms, 2%
gpu::code_object::mlir_convolution_mul_add_add_relu: 0.122514ms / 3 = 0.040838ms, 2%
gpu::code_object::add_relu_noop_concat_noop_kernel: 0.0843656ms / 3 = 0.0281219ms, 1%
gpu::gemm: 0.0727626ms / 1 = 0.0727626ms, 1%
gpu::pooling: 0.067244ms / 1 = 0.067244ms, 1%
gpu::code_object::mul_kernel: 0.051746ms / 2 = 0.025873ms, 1%
@literal: 0.0353714ms / 117 = 0.00030232ms, 1%
gpu::code_object::concat_add_relu_kernel: 0.0287566ms / 1 = 0.0287566ms, 1%
gpu::code_object::mul_add_add_relu_mul_reduce_sum_kernel: 0.0274078ms / 1 = 0.0274078ms, 1%
gpu::code_object::add_relu_kernel: 0.0251514ms / 1 = 0.0251514ms, 1%
step: 0.0067122ms / 3 = 0.0022374ms, 1%
hip::hip_copy_literal: 0.0028988ms / 2 = 0.0014494ms, 1%
reshape_lazy: 0.0017368ms / 1 = 0.0017368ms, 1%
@param: 0.0017358ms / 2 = 0.0008679ms, 1%
hip::hip_allocate_memory: 0.0012642ms / 1 = 0.0012642ms, 1%
check_context::migraphx::gpu::context: 0.0010434ms / 1 = 0.0010434ms, 1%
Batch size: 1
Rate: 162.816 inferences/sec
Total time: 6.1419ms
Total instructions time: 10.6934ms
Overhead time: 0.132798ms, -4.55153ms
Overhead: 2%, -74%
Failing Windows build right now, seems to be a problem with fuse_mlir from develop
Failing Windows build right now, seems to be a problem with
fuse_mlirfrom develop
All good. That'll get resolved on its own once develop changes.
Make sure to also add some more test coverage as well. Codecov/patch says you have about 62.50% coverage on your change set. We try to maintain about 92-93% coverage so add a few more unit tests that exercise the code paths you've added
Make sure to also add some more test coverage as well. Codecov/patch says you have about 62.50% coverage on your change set. We try to maintain about 92-93% coverage so add a few more unit tests that exercise the code paths you've added
The changes in src/program.cpp I think are mostly due to the addition of passing a bool into generic_eval(), which wasn't originally covered. I can add some tests to cover that although it would require the setting of an env var to change the trace level, is there a way to do this through the tests?
Discussed with Umang and Alan, will ignore codecov failure for now as the code path not covered is purely for debugging. Can add later if needed if it's possible to set env vars inside of a test.
Based on this trace it appears as if the copies are streaming properly on a separate stream, although there appears to be a quite large delay.
| Test | Batch | Rate new 0fc282 |
Rate old ddc4c0 |
Diff | Compare |
|---|---|---|---|---|---|
| torchvision-resnet50 | 64 | 3,261.48 | 3,261.96 | -0.01% | :white_check_mark: |
| torchvision-resnet50_fp16 | 64 | 6,988.20 | 6,986.50 | 0.02% | :white_check_mark: |
| torchvision-densenet121 | 32 | 2,436.70 | 2,435.05 | 0.07% | :white_check_mark: |
| torchvision-densenet121_fp16 | 32 | 4,076.48 | 4,098.64 | -0.54% | :white_check_mark: |
| torchvision-inceptionv3 | 32 | 1,638.92 | 1,637.72 | 0.07% | :white_check_mark: |
| torchvision-inceptionv3_fp16 | 32 | 2,762.55 | 2,760.21 | 0.08% | :white_check_mark: |
| cadene-inceptionv4 | 16 | 775.80 | 776.83 | -0.13% | :white_check_mark: |
| cadene-resnext64x4 | 16 | 810.72 | 812.01 | -0.16% | :white_check_mark: |
| slim-mobilenet | 64 | 7,532.36 | 7,539.31 | -0.09% | :white_check_mark: |
| slim-nasnetalarge | 64 | 211.49 | 211.58 | -0.04% | :white_check_mark: |
| slim-resnet50v2 | 64 | 3,505.22 | 3,507.45 | -0.06% | :white_check_mark: |
| bert-mrpc-onnx | 8 | 1,147.43 | 1,148.38 | -0.08% | :white_check_mark: |
| bert-mrpc-tf | 1 | 493.92 | 462.39 | 6.82% | :high_brightness: |
| pytorch-examples-wlang-gru | 1 | 434.13 | 416.01 | 4.36% | :high_brightness: |
| pytorch-examples-wlang-lstm | 1 | 405.95 | 387.25 | 4.83% | :high_brightness: |
| torchvision-resnet50_1 | 1 | 797.47 | 799.30 | -0.23% | :white_check_mark: |
| cadene-dpn92_1 | 1 | 398.08 | 413.07 | -3.63% | :red_circle: |
| cadene-resnext101_1 | 1 | 384.99 | 384.54 | 0.12% | :white_check_mark: |
| onnx-taau-downsample | 1 | 343.18 | 343.19 | -0.00% | :white_check_mark: |
| dlrm-criteoterabyte | 1 | 33.33 | 33.33 | 0.02% | :white_check_mark: |
| dlrm-criteoterabyte_fp16 | 1 | 52.74 | 52.75 | -0.02% | :white_check_mark: |
| agentmodel | 1 | 8,118.46 | 8,254.64 | -1.65% | :white_check_mark: |
| unet_fp16 | 2 | 58.84 | 58.74 | 0.17% | :white_check_mark: |
| resnet50v1_fp16 | 1 | 940.57 | 942.04 | -0.16% | :white_check_mark: |
| resnet50v1_int8 | 1 | 996.35 | 993.52 | 0.28% | :white_check_mark: |
| bert_base_cased_fp16 | 64 | 1,170.71 | 1,171.05 | -0.03% | :white_check_mark: |
| bert_large_uncased_fp16 | 32 | 363.57 | 363.48 | 0.02% | :white_check_mark: |
| bert_large_fp16 | 1 | 199.12 | 200.70 | -0.79% | :white_check_mark: |
| distilgpt2_fp16 | 16 | 2,202.47 | 2,201.26 | 0.05% | :white_check_mark: |
| yolov5s | 1 | 533.05 | 537.45 | -0.82% | :white_check_mark: |
| tinyllama | 1 | 43.41 | 43.68 | -0.62% | :white_check_mark: |
| vicuna-fastchat | 1 | 179.88 | 175.99 | 2.21% | :white_check_mark: |
| whisper-tiny-encoder | 1 | 416.38 | 418.69 | -0.55% | :white_check_mark: |
| whisper-tiny-decoder | 1 | 428.82 | 428.55 | 0.06% | :white_check_mark: |
This build is not recommended to merge :red_circle:
:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output