AMDMIGraphX Add weight streaming

Add weight streaming to allow running of large models on GPUs with low memory.

Closes #3156.

Jun 26 '24 20:06 eddieliao

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.17%. Comparing base (ddc4c0c) to head (0fc282e).

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #3222   +/-   ##
========================================
  Coverage    92.17%   92.17%           
========================================
  Files          512      512           
  Lines        21387    21387           
========================================
  Hits         19714    19714           
  Misses        1673     1673

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jul 01 '24 17:07 codecov[bot]

17 of 18 test failures right now for test/schedule_test is due to a check of EXPECT(not t.has_stream(one));. This is failing due to the fact that I have added both literals and their associated copy_to_gpu to a stream. Should literals thus not be included on the stream?

Jul 02 '24 16:07 eddieliao

Doing some very basic perf testing adding the literals onto the stream with copies seems to very marginally improve performance. Still not sure if the tests can be ignored since this is a new feature.

Jul 03 '24 20:07 eddieliao

Seems to be stuck on a Windows build for some reason? Looking into why but would appreciate help if anybody has any pointers

Jul 09 '24 14:07 eddieliao

Removed call to make_shared_array to avoid slowdown due to std::copy.

Running ./build/bin/driver perf models/resnet50-v2-7.onnx --enable-weight-streaming:

Before:

Summary:
hip::copy_to_gpu: 55.82ms / 117 = 0.477094ms, 61%
@literal: 31.1419ms / 117 = 0.26617ms, 34%
gpu::miopen_fusion: 1.33101ms / 12 = 0.110917ms, 2%
gpu::code_object::mlir_convolution_add_relu: 0.784531ms / 17 = 0.0461489ms, 1%
gpu::convolution: 0.573863ms / 4 = 0.143466ms, 1%
load: 0.317283ms / 191 = 0.00166117ms, 1%
gpu::code_object::mlir_convolution_add: 0.278956ms / 8 = 0.0348696ms, 1%
gpu::code_object::mul_add_relu_kernel: 0.244727ms / 8 = 0.0305909ms, 1%
gpu::code_object::mlir_convolution: 0.236977ms / 5 = 0.0473954ms, 1%
gpu::code_object::mlir_convolution_mul_add_relu: 0.232219ms / 4 = 0.0580549ms, 1%
multibroadcast: 0.206982ms / 67 = 0.00308928ms, 1%
gpu::gemm: 0.167417ms / 1 = 0.167417ms, 1%
gpu::code_object::mlir_convolution_mul_add_add_relu: 0.124884ms / 3 = 0.0416281ms, 1%
gpu::pooling: 0.105871ms / 1 = 0.105871ms, 1%
gpu::code_object::add_relu_noop_concat_noop_kernel: 0.085841ms / 3 = 0.0286137ms, 1%
gpu::code_object::mul_kernel: 0.0637746ms / 2 = 0.0318873ms, 1%
gpu::code_object::concat_add_relu_kernel: 0.0290406ms / 1 = 0.0290406ms, 1%
gpu::code_object::mul_add_add_relu_mul_reduce_sum_kernel: 0.02897ms / 1 = 0.02897ms, 1%
gpu::code_object::add_relu_kernel: 0.0275658ms / 1 = 0.0275658ms, 1%
step: 0.0107792ms / 3 = 0.00359307ms, 1%
hip::hip_copy_literal: 0.009445ms / 2 = 0.0047225ms, 1%
@param: 0.0049534ms / 2 = 0.0024767ms, 1%
hip::hip_allocate_memory: 0.0043382ms / 1 = 0.0043382ms, 1%
check_context::migraphx::gpu::context: 0.0036888ms / 1 = 0.0036888ms, 1%
reshape_lazy: 0.003683ms / 1 = 0.003683ms, 1%

Batch size: 1
Rate: 19.0196 inferences/sec
Total time: 52.5772ms
Total instructions time: 91.8387ms
Overhead time: 0.135608ms, -39.2615ms
Overhead: 0%, -75%

After:

Summary:
hip::copy_to_gpu: 6.37803ms / 117 = 0.0545131ms, 60%
gpu::miopen_fusion: 1.12564ms / 12 = 0.0938037ms, 11%
gpu::code_object::mlir_convolution_add_relu: 0.747901ms / 17 = 0.0439942ms, 7%
load: 0.378693ms / 191 = 0.00198269ms, 4%
gpu::convolution: 0.349078ms / 4 = 0.0872696ms, 4%
gpu::code_object::mlir_convolution_add: 0.267458ms / 8 = 0.0334322ms, 3%
multibroadcast: 0.257707ms / 67 = 0.00384638ms, 3%
gpu::code_object::mlir_convolution: 0.225739ms / 5 = 0.0451478ms, 3%
gpu::code_object::mul_add_relu_kernel: 0.220406ms / 8 = 0.0275508ms, 3%
gpu::code_object::mlir_convolution_mul_add_relu: 0.212062ms / 4 = 0.0530154ms, 2%
gpu::code_object::mlir_convolution_mul_add_add_relu: 0.122514ms / 3 = 0.040838ms, 2%
gpu::code_object::add_relu_noop_concat_noop_kernel: 0.0843656ms / 3 = 0.0281219ms, 1%
gpu::gemm: 0.0727626ms / 1 = 0.0727626ms, 1%
gpu::pooling: 0.067244ms / 1 = 0.067244ms, 1%
gpu::code_object::mul_kernel: 0.051746ms / 2 = 0.025873ms, 1%
@literal: 0.0353714ms / 117 = 0.00030232ms, 1%
gpu::code_object::concat_add_relu_kernel: 0.0287566ms / 1 = 0.0287566ms, 1%
gpu::code_object::mul_add_add_relu_mul_reduce_sum_kernel: 0.0274078ms / 1 = 0.0274078ms, 1%
gpu::code_object::add_relu_kernel: 0.0251514ms / 1 = 0.0251514ms, 1%
step: 0.0067122ms / 3 = 0.0022374ms, 1%
hip::hip_copy_literal: 0.0028988ms / 2 = 0.0014494ms, 1%
reshape_lazy: 0.0017368ms / 1 = 0.0017368ms, 1%
@param: 0.0017358ms / 2 = 0.0008679ms, 1%
hip::hip_allocate_memory: 0.0012642ms / 1 = 0.0012642ms, 1%
check_context::migraphx::gpu::context: 0.0010434ms / 1 = 0.0010434ms, 1%

Batch size: 1
Rate: 162.816 inferences/sec
Total time: 6.1419ms
Total instructions time: 10.6934ms
Overhead time: 0.132798ms, -4.55153ms
Overhead: 2%, -74%

Jul 15 '24 15:07 eddieliao

Failing Windows build right now, seems to be a problem with fuse_mlir from develop

Jul 16 '24 17:07 eddieliao

Failing Windows build right now, seems to be a problem with fuse_mlir from develop

All good. That'll get resolved on its own once develop changes.

Make sure to also add some more test coverage as well. Codecov/patch says you have about 62.50% coverage on your change set. We try to maintain about 92-93% coverage so add a few more unit tests that exercise the code paths you've added

Jul 16 '24 18:07 TedThemistokleous

Make sure to also add some more test coverage as well. Codecov/patch says you have about 62.50% coverage on your change set. We try to maintain about 92-93% coverage so add a few more unit tests that exercise the code paths you've added

The changes in src/program.cpp I think are mostly due to the addition of passing a bool into generic_eval(), which wasn't originally covered. I can add some tests to cover that although it would require the setting of an env var to change the trace level, is there a way to do this through the tests?

Jul 16 '24 19:07 eddieliao

Discussed with Umang and Alan, will ignore codecov failure for now as the code path not covered is purely for debugging. Can add later if needed if it's possible to set env vars inside of a test.

Based on this trace it appears as if the copies are streaming properly on a separate stream, although there appears to be a quite large delay.

Jul 19 '24 15:07 eddieliao

Test Batch Rate new
0fc282 Rate old
ddc4c0 Diff Compare

torchvision-resnet50 64 3,261.48 3,261.96 -0.01% :white_check_mark:

torchvision-resnet50_fp16 64 6,988.20 6,986.50 0.02% :white_check_mark:

torchvision-densenet121 32 2,436.70 2,435.05 0.07% :white_check_mark:

torchvision-densenet121_fp16 32 4,076.48 4,098.64 -0.54% :white_check_mark:

torchvision-inceptionv3 32 1,638.92 1,637.72 0.07% :white_check_mark:

torchvision-inceptionv3_fp16 32 2,762.55 2,760.21 0.08% :white_check_mark:

cadene-inceptionv4 16 775.80 776.83 -0.13% :white_check_mark:

cadene-resnext64x4 16 810.72 812.01 -0.16% :white_check_mark:

slim-mobilenet 64 7,532.36 7,539.31 -0.09% :white_check_mark:

slim-nasnetalarge 64 211.49 211.58 -0.04% :white_check_mark:

slim-resnet50v2 64 3,505.22 3,507.45 -0.06% :white_check_mark:

bert-mrpc-onnx 8 1,147.43 1,148.38 -0.08% :white_check_mark:

bert-mrpc-tf 1 493.92 462.39 6.82% :high_brightness:

pytorch-examples-wlang-gru 1 434.13 416.01 4.36% :high_brightness:

pytorch-examples-wlang-lstm 1 405.95 387.25 4.83% :high_brightness:

torchvision-resnet50_1 1 797.47 799.30 -0.23% :white_check_mark:

cadene-dpn92_1 1 398.08 413.07 -3.63% :red_circle:

cadene-resnext101_1 1 384.99 384.54 0.12% :white_check_mark:

onnx-taau-downsample 1 343.18 343.19 -0.00% :white_check_mark:

dlrm-criteoterabyte 1 33.33 33.33 0.02% :white_check_mark:

dlrm-criteoterabyte_fp16 1 52.74 52.75 -0.02% :white_check_mark:

agentmodel 1 8,118.46 8,254.64 -1.65% :white_check_mark:

unet_fp16 2 58.84 58.74 0.17% :white_check_mark:

resnet50v1_fp16 1 940.57 942.04 -0.16% :white_check_mark:

resnet50v1_int8 1 996.35 993.52 0.28% :white_check_mark:

bert_base_cased_fp16 64 1,170.71 1,171.05 -0.03% :white_check_mark:

bert_large_uncased_fp16 32 363.57 363.48 0.02% :white_check_mark:

bert_large_fp16 1 199.12 200.70 -0.79% :white_check_mark:

distilgpt2_fp16 16 2,202.47 2,201.26 0.05% :white_check_mark:

yolov5s 1 533.05 537.45 -0.82% :white_check_mark:

tinyllama 1 43.41 43.68 -0.62% :white_check_mark:

vicuna-fastchat 1 179.88 175.99 2.21% :white_check_mark:

whisper-tiny-encoder 1 416.38 418.69 -0.55% :white_check_mark:

whisper-tiny-decoder 1 428.82 428.55 0.06% :white_check_mark:

Test	Batch	Rate new 0fc282	Rate old ddc4c0	Diff	Compare
torchvision-resnet50	64	3,261.48	3,261.96	-0.01%	:white_check_mark:
torchvision-resnet50_fp16	64	6,988.20	6,986.50	0.02%	:white_check_mark:
torchvision-densenet121	32	2,436.70	2,435.05	0.07%	:white_check_mark:
torchvision-densenet121_fp16	32	4,076.48	4,098.64	-0.54%	:white_check_mark:
torchvision-inceptionv3	32	1,638.92	1,637.72	0.07%	:white_check_mark:
torchvision-inceptionv3_fp16	32	2,762.55	2,760.21	0.08%	:white_check_mark:
cadene-inceptionv4	16	775.80	776.83	-0.13%	:white_check_mark:
cadene-resnext64x4	16	810.72	812.01	-0.16%	:white_check_mark:
slim-mobilenet	64	7,532.36	7,539.31	-0.09%	:white_check_mark:
slim-nasnetalarge	64	211.49	211.58	-0.04%	:white_check_mark:
slim-resnet50v2	64	3,505.22	3,507.45	-0.06%	:white_check_mark:
bert-mrpc-onnx	8	1,147.43	1,148.38	-0.08%	:white_check_mark:
bert-mrpc-tf	1	493.92	462.39	6.82%	:high_brightness:
pytorch-examples-wlang-gru	1	434.13	416.01	4.36%	:high_brightness:
pytorch-examples-wlang-lstm	1	405.95	387.25	4.83%	:high_brightness:
torchvision-resnet50_1	1	797.47	799.30	-0.23%	:white_check_mark:
cadene-dpn92_1	1	398.08	413.07	-3.63%	:red_circle:
cadene-resnext101_1	1	384.99	384.54	0.12%	:white_check_mark:
onnx-taau-downsample	1	343.18	343.19	-0.00%	:white_check_mark:
dlrm-criteoterabyte	1	33.33	33.33	0.02%	:white_check_mark:
dlrm-criteoterabyte_fp16	1	52.74	52.75	-0.02%	:white_check_mark:
agentmodel	1	8,118.46	8,254.64	-1.65%	:white_check_mark:
unet_fp16	2	58.84	58.74	0.17%	:white_check_mark:
resnet50v1_fp16	1	940.57	942.04	-0.16%	:white_check_mark:
resnet50v1_int8	1	996.35	993.52	0.28%	:white_check_mark:
bert_base_cased_fp16	64	1,170.71	1,171.05	-0.03%	:white_check_mark:
bert_large_uncased_fp16	32	363.57	363.48	0.02%	:white_check_mark:
bert_large_fp16	1	199.12	200.70	-0.79%	:white_check_mark:
distilgpt2_fp16	16	2,202.47	2,201.26	0.05%	:white_check_mark:
yolov5s	1	533.05	537.45	-0.82%	:white_check_mark:
tinyllama	1	43.41	43.68	-0.62%	:white_check_mark:
vicuna-fastchat	1	179.88	175.99	2.21%	:white_check_mark:
whisper-tiny-encoder	1	416.38	418.69	-0.55%	:white_check_mark:
whisper-tiny-decoder	1	428.82	428.55	0.06%	:white_check_mark:

This build is not recommended to merge :red_circle:

Oct 16 '24 15:10 migraphx-bot

:white_check_mark: bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

:white_check_mark: bert-mrpc-tf: PASSED: MIGraphX meets tolerance

:white_check_mark: pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

:white_check_mark: pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

:white_check_mark: torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

:white_check_mark: cadene-dpn92_1: PASSED: MIGraphX meets tolerance

:white_check_mark: cadene-resnext101_1: PASSED: MIGraphX meets tolerance

:white_check_mark: dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

:white_check_mark: agentmodel: PASSED: MIGraphX meets tolerance

:white_check_mark: unet: PASSED: MIGraphX meets tolerance

:white_check_mark: resnet50v1: PASSED: MIGraphX meets tolerance

:white_check_mark: bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

:white_check_mark: bert_large: PASSED: MIGraphX meets tolerance

:white_check_mark: yolov5s: PASSED: MIGraphX meets tolerance

:white_check_mark: tinyllama: PASSED: MIGraphX meets tolerance

:white_check_mark: vicuna-fastchat: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

:white_check_mark: whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

:white_check_mark: distilgpt2_fp16: PASSED: MIGraphX meets tolerance

Oct 16 '24 15:10 migraphx-bot

AMDMIGraphX AMDMIGraphX copied to clipboard

Add weight streaming

Codecov Report

AMDMIGraphX
AMDMIGraphX copied to clipboard