AMDMIGraphX icon indicating copy to clipboard operation
AMDMIGraphX copied to clipboard

Add weight streaming

Open eddieliao opened this issue 1 year ago • 11 comments

Add weight streaming to allow running of large models on GPUs with low memory.

Closes #3156.

eddieliao avatar Jun 26 '24 20:06 eddieliao

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.17%. Comparing base (ddc4c0c) to head (0fc282e).

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #3222   +/-   ##
========================================
  Coverage    92.17%   92.17%           
========================================
  Files          512      512           
  Lines        21387    21387           
========================================
  Hits         19714    19714           
  Misses        1673     1673           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jul 01 '24 17:07 codecov[bot]

17 of 18 test failures right now for test/schedule_test is due to a check of EXPECT(not t.has_stream(one));. This is failing due to the fact that I have added both literals and their associated copy_to_gpu to a stream. Should literals thus not be included on the stream?

eddieliao avatar Jul 02 '24 16:07 eddieliao

Doing some very basic perf testing adding the literals onto the stream with copies seems to very marginally improve performance. Still not sure if the tests can be ignored since this is a new feature.

eddieliao avatar Jul 03 '24 20:07 eddieliao

Seems to be stuck on a Windows build for some reason? Looking into why but would appreciate help if anybody has any pointers

eddieliao avatar Jul 09 '24 14:07 eddieliao

Removed call to make_shared_array to avoid slowdown due to std::copy.

Running ./build/bin/driver perf models/resnet50-v2-7.onnx --enable-weight-streaming:

Before:

Summary:
hip::copy_to_gpu: 55.82ms / 117 = 0.477094ms, 61%
@literal: 31.1419ms / 117 = 0.26617ms, 34%
gpu::miopen_fusion: 1.33101ms / 12 = 0.110917ms, 2%
gpu::code_object::mlir_convolution_add_relu: 0.784531ms / 17 = 0.0461489ms, 1%
gpu::convolution: 0.573863ms / 4 = 0.143466ms, 1%
load: 0.317283ms / 191 = 0.00166117ms, 1%
gpu::code_object::mlir_convolution_add: 0.278956ms / 8 = 0.0348696ms, 1%
gpu::code_object::mul_add_relu_kernel: 0.244727ms / 8 = 0.0305909ms, 1%
gpu::code_object::mlir_convolution: 0.236977ms / 5 = 0.0473954ms, 1%
gpu::code_object::mlir_convolution_mul_add_relu: 0.232219ms / 4 = 0.0580549ms, 1%
multibroadcast: 0.206982ms / 67 = 0.00308928ms, 1%
gpu::gemm: 0.167417ms / 1 = 0.167417ms, 1%
gpu::code_object::mlir_convolution_mul_add_add_relu: 0.124884ms / 3 = 0.0416281ms, 1%
gpu::pooling: 0.105871ms / 1 = 0.105871ms, 1%
gpu::code_object::add_relu_noop_concat_noop_kernel: 0.085841ms / 3 = 0.0286137ms, 1%
gpu::code_object::mul_kernel: 0.0637746ms / 2 = 0.0318873ms, 1%
gpu::code_object::concat_add_relu_kernel: 0.0290406ms / 1 = 0.0290406ms, 1%
gpu::code_object::mul_add_add_relu_mul_reduce_sum_kernel: 0.02897ms / 1 = 0.02897ms, 1%
gpu::code_object::add_relu_kernel: 0.0275658ms / 1 = 0.0275658ms, 1%
step: 0.0107792ms / 3 = 0.00359307ms, 1%
hip::hip_copy_literal: 0.009445ms / 2 = 0.0047225ms, 1%
@param: 0.0049534ms / 2 = 0.0024767ms, 1%
hip::hip_allocate_memory: 0.0043382ms / 1 = 0.0043382ms, 1%
check_context::migraphx::gpu::context: 0.0036888ms / 1 = 0.0036888ms, 1%
reshape_lazy: 0.003683ms / 1 = 0.003683ms, 1%

Batch size: 1
Rate: 19.0196 inferences/sec
Total time: 52.5772ms
Total instructions time: 91.8387ms
Overhead time: 0.135608ms, -39.2615ms
Overhead: 0%, -75%

After:

Summary:
hip::copy_to_gpu: 6.37803ms / 117 = 0.0545131ms, 60%
gpu::miopen_fusion: 1.12564ms / 12 = 0.0938037ms, 11%
gpu::code_object::mlir_convolution_add_relu: 0.747901ms / 17 = 0.0439942ms, 7%
load: 0.378693ms / 191 = 0.00198269ms, 4%
gpu::convolution: 0.349078ms / 4 = 0.0872696ms, 4%
gpu::code_object::mlir_convolution_add: 0.267458ms / 8 = 0.0334322ms, 3%
multibroadcast: 0.257707ms / 67 = 0.00384638ms, 3%
gpu::code_object::mlir_convolution: 0.225739ms / 5 = 0.0451478ms, 3%
gpu::code_object::mul_add_relu_kernel: 0.220406ms / 8 = 0.0275508ms, 3%
gpu::code_object::mlir_convolution_mul_add_relu: 0.212062ms / 4 = 0.0530154ms, 2%
gpu::code_object::mlir_convolution_mul_add_add_relu: 0.122514ms / 3 = 0.040838ms, 2%
gpu::code_object::add_relu_noop_concat_noop_kernel: 0.0843656ms / 3 = 0.0281219ms, 1%
gpu::gemm: 0.0727626ms / 1 = 0.0727626ms, 1%
gpu::pooling: 0.067244ms / 1 = 0.067244ms, 1%
gpu::code_object::mul_kernel: 0.051746ms / 2 = 0.025873ms, 1%
@literal: 0.0353714ms / 117 = 0.00030232ms, 1%
gpu::code_object::concat_add_relu_kernel: 0.0287566ms / 1 = 0.0287566ms, 1%
gpu::code_object::mul_add_add_relu_mul_reduce_sum_kernel: 0.0274078ms / 1 = 0.0274078ms, 1%
gpu::code_object::add_relu_kernel: 0.0251514ms / 1 = 0.0251514ms, 1%
step: 0.0067122ms / 3 = 0.0022374ms, 1%
hip::hip_copy_literal: 0.0028988ms / 2 = 0.0014494ms, 1%
reshape_lazy: 0.0017368ms / 1 = 0.0017368ms, 1%
@param: 0.0017358ms / 2 = 0.0008679ms, 1%
hip::hip_allocate_memory: 0.0012642ms / 1 = 0.0012642ms, 1%
check_context::migraphx::gpu::context: 0.0010434ms / 1 = 0.0010434ms, 1%

Batch size: 1
Rate: 162.816 inferences/sec
Total time: 6.1419ms
Total instructions time: 10.6934ms
Overhead time: 0.132798ms, -4.55153ms
Overhead: 2%, -74%

eddieliao avatar Jul 15 '24 15:07 eddieliao

Failing Windows build right now, seems to be a problem with fuse_mlir from develop

eddieliao avatar Jul 16 '24 17:07 eddieliao

Failing Windows build right now, seems to be a problem with fuse_mlir from develop

All good. That'll get resolved on its own once develop changes.

Make sure to also add some more test coverage as well. Codecov/patch says you have about 62.50% coverage on your change set. We try to maintain about 92-93% coverage so add a few more unit tests that exercise the code paths you've added

TedThemistokleous avatar Jul 16 '24 18:07 TedThemistokleous

Make sure to also add some more test coverage as well. Codecov/patch says you have about 62.50% coverage on your change set. We try to maintain about 92-93% coverage so add a few more unit tests that exercise the code paths you've added

The changes in src/program.cpp I think are mostly due to the addition of passing a bool into generic_eval(), which wasn't originally covered. I can add some tests to cover that although it would require the setting of an env var to change the trace level, is there a way to do this through the tests?

eddieliao avatar Jul 16 '24 19:07 eddieliao

Discussed with Umang and Alan, will ignore codecov failure for now as the code path not covered is purely for debugging. Can add later if needed if it's possible to set env vars inside of a test.

Based on this trace it appears as if the copies are streaming properly on a separate stream, although there appears to be a quite large delay. image

eddieliao avatar Jul 19 '24 15:07 eddieliao

Test Batch Rate new
0fc282
Rate old
ddc4c0
Diff Compare
torchvision-resnet50 64 3,261.48 3,261.96 -0.01% :white_check_mark:
torchvision-resnet50_fp16 64 6,988.20 6,986.50 0.02% :white_check_mark:
torchvision-densenet121 32 2,436.70 2,435.05 0.07% :white_check_mark:
torchvision-densenet121_fp16 32 4,076.48 4,098.64 -0.54% :white_check_mark:
torchvision-inceptionv3 32 1,638.92 1,637.72 0.07% :white_check_mark:
torchvision-inceptionv3_fp16 32 2,762.55 2,760.21 0.08% :white_check_mark:
cadene-inceptionv4 16 775.80 776.83 -0.13% :white_check_mark:
cadene-resnext64x4 16 810.72 812.01 -0.16% :white_check_mark:
slim-mobilenet 64 7,532.36 7,539.31 -0.09% :white_check_mark:
slim-nasnetalarge 64 211.49 211.58 -0.04% :white_check_mark:
slim-resnet50v2 64 3,505.22 3,507.45 -0.06% :white_check_mark:
bert-mrpc-onnx 8 1,147.43 1,148.38 -0.08% :white_check_mark:
bert-mrpc-tf 1 493.92 462.39 6.82% :high_brightness:
pytorch-examples-wlang-gru 1 434.13 416.01 4.36% :high_brightness:
pytorch-examples-wlang-lstm 1 405.95 387.25 4.83% :high_brightness:
torchvision-resnet50_1 1 797.47 799.30 -0.23% :white_check_mark:
cadene-dpn92_1 1 398.08 413.07 -3.63% :red_circle:
cadene-resnext101_1 1 384.99 384.54 0.12% :white_check_mark:
onnx-taau-downsample 1 343.18 343.19 -0.00% :white_check_mark:
dlrm-criteoterabyte 1 33.33 33.33 0.02% :white_check_mark:
dlrm-criteoterabyte_fp16 1 52.74 52.75 -0.02% :white_check_mark:
agentmodel 1 8,118.46 8,254.64 -1.65% :white_check_mark:
unet_fp16 2 58.84 58.74 0.17% :white_check_mark:
resnet50v1_fp16 1 940.57 942.04 -0.16% :white_check_mark:
resnet50v1_int8 1 996.35 993.52 0.28% :white_check_mark:
bert_base_cased_fp16 64 1,170.71 1,171.05 -0.03% :white_check_mark:
bert_large_uncased_fp16 32 363.57 363.48 0.02% :white_check_mark:
bert_large_fp16 1 199.12 200.70 -0.79% :white_check_mark:
distilgpt2_fp16 16 2,202.47 2,201.26 0.05% :white_check_mark:
yolov5s 1 533.05 537.45 -0.82% :white_check_mark:
tinyllama 1 43.41 43.68 -0.62% :white_check_mark:
vicuna-fastchat 1 179.88 175.99 2.21% :white_check_mark:
whisper-tiny-encoder 1 416.38 418.69 -0.55% :white_check_mark:
whisper-tiny-decoder 1 428.82 428.55 0.06% :white_check_mark:

This build is not recommended to merge :red_circle:

migraphx-bot avatar Oct 16 '24 15:10 migraphx-bot


     :white_check_mark: bert-mrpc-onnx: PASSED: MIGraphX meets tolerance
     :white_check_mark: bert-mrpc-tf: PASSED: MIGraphX meets tolerance
     :white_check_mark: pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance
     :white_check_mark: pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance
     :white_check_mark: torchvision-resnet50_1: PASSED: MIGraphX meets tolerance
     :white_check_mark: cadene-dpn92_1: PASSED: MIGraphX meets tolerance
     :white_check_mark: cadene-resnext101_1: PASSED: MIGraphX meets tolerance
     :white_check_mark: dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance
     :white_check_mark: agentmodel: PASSED: MIGraphX meets tolerance
     :white_check_mark: unet: PASSED: MIGraphX meets tolerance
     :white_check_mark: resnet50v1: PASSED: MIGraphX meets tolerance
     :white_check_mark: bert_base_cased_fp16: PASSED: MIGraphX meets tolerance
:red_circle:bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

     :white_check_mark: bert_large: PASSED: MIGraphX meets tolerance
     :white_check_mark: yolov5s: PASSED: MIGraphX meets tolerance
     :white_check_mark: tinyllama: PASSED: MIGraphX meets tolerance
     :white_check_mark: vicuna-fastchat: PASSED: MIGraphX meets tolerance
     :white_check_mark: whisper-tiny-encoder: PASSED: MIGraphX meets tolerance
     :white_check_mark: whisper-tiny-decoder: PASSED: MIGraphX meets tolerance
     :white_check_mark: distilgpt2_fp16: PASSED: MIGraphX meets tolerance

migraphx-bot avatar Oct 16 '24 15:10 migraphx-bot