optimum-habana icon indicating copy to clipboard operation
optimum-habana copied to clipboard

Added GPT-J FP8 support

Open kplau1128 opened this issue 10 months ago • 9 comments

What does this PR do?

Add GPT-J FP8 supported

Command Lines:

  • Measure the tensor quantization statistic on EleutherAI/gpt-j-6b

    QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py \
      --model_name_or_path EleutherAI/gpt-j-6b \
      --use_hpu_graphs \
      --use_kv_cache \
      --limit_hpu_graphs \
      --max_input_tokens 128 \
      --max_new_tokens 128 \
      --batch_size 1 \
      --bf16
    
  • Quantize the model based on previous measurements for EleutherAI/gpt-j-6b

    QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py \
    --model_name_or_path EleutherAI/gpt-j-6b \
    --use_hpu_graphs \
    --use_kv_cache \
    --limit_hpu_graphs \
    --max_input_tokens 128 \
    --max_new_tokens 128 \
    --batch_size 256 \
    --bf16 \
    --fp8
    

kplau1128 avatar Apr 22 '24 16:04 kplau1128

@kplau1128 , Can you please add some command lines to the PR description, that was used to test this

@ssarkar2 Command lines for measure and quantization added on description.

kplau1128 avatar Apr 24 '24 18:04 kplau1128

@kplau1128 , This PR seems to be in "Draft" mode. Can you please more it to regular PR if its ready

image

ssarkar2 avatar Apr 25 '24 17:04 ssarkar2

@kplau1128 , This PR seems to be in "Draft" mode. Can you please more it to regular PR if its ready

@ssarkar2 We seeing output is incorrect when run quantization mode, debug in progress, once identify the cause and fix will update to ready.

Did you see this issue before? Any guide suggestion for debugging this issue?

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ("DeepSpeed is a machine learning framework for--------------'p-game-p-game-factors-game-p-factors-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-not-",)

input 2: ('He is working on',)
output 1: ('He is working on a- a- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -',)

input 3: ('He has a',)
output 1: ('He has a lot,, a, a,operoperoperoperoper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper',)

kplau1128 avatar Apr 25 '24 18:04 kplau1128

Resolved output incorrect issue

kplau1128 avatar Apr 26 '24 02:04 kplau1128

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Can you please run bf16 tests and make sure you are getting similar numbers with and without this PR as well. for example a couple of configs like 128->128, 128->2048, 2048->128, 2048->2048 (i->o, meaning max_input_tokens=i and max_new_tokens=o)

also can you please run make style at the topmost folder, because the PR is failing style check

ssarkar2 avatar Apr 30 '24 21:04 ssarkar2

Can you please run bf16 tests and make sure you are getting similar numbers with and without this PR as well. for example a couple of configs like 128->128, 128->2048, 2048->128, 2048->2048 (i->o, meaning max_input_tokens=i and max_new_tokens=o)

also can you please run make style at the topmost folder, because the PR is failing style check

Style checked.

Also add commit with added --reuse_cache option support.

kplau1128 avatar May 01 '24 23:05 kplau1128

Test with this PR seeing throughput improve, but still low.

Also see main GPT-J code already has The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3 issue, when either max_input_tokens or max_new_tokens set to 2048

Without this PR:

Model batch_size max_input_tokens max_new_tokens Throughput (tokens/second) Number of HPU graphs Memory allocated (GB) Max memory allocated (GB) Total memory available (GB) Graph compilation duration (seconds) Error
GPT-J-6B Measure 1 128 128 147.26 11 11.52 11.68 94.62 8.71
GPT-J-6B Quantization 1024 128 128 RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2147483648 (2048)MB
GPT-J-6B Quantization 640 128 128 RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::33030144000 (31500)MB
GPT-J-6B Quantization 320 128 128 RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::16515072000 (15750)MB
GPT-J-6B Quantization 256 128 128 4978.55 15 40.21 86.68 94.62 80.36
GPT-J-6B Quantization 120 2048 128 RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3
GPT-J-6B Quantization 120 1024 128 RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::27869184000 (26578.1)MB
GPT-J-6B Quantization 120 368 128 2823.54 15 37.01 76.51 94.62 78.28
GPT-J-6B Quantization 120 128 2048 RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3
GPT-J-6B Quantization 120 128 1024 RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::27869184000 (26578.1)MB
GPT-J-6B Quantization 120 128 624 RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::18192384000 (17349.6)MB
GPT-J-6B Quantization 120 128 512 3706.45 15 46.02 94.62 94.62 113.97
GPT-J-6B Quantization 120 128 256 4687.05 15 30.01 62.27 94.62 77.36
GPT-J-6B Quantization 64 2048 128 RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3
GPT-J-6B Quantization 64 1024 128 1183.81 15 44.39 91.28 94.62 103.23
GPT-J-6B Quantization 64 512 128 2032.33 15 27.35 54.08 94.62 70.22
GPT-J-6B Quantization 64 2048 2048 RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 3
GPT-J-6B Quantization 64 1024 1024 RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::26424115200 (25200)MB
GPT-J-6B Quantization 64 512 512 2514.55 15 40.12 82.57 94.62 113.1

With this PR

Model batch_size max_input_tokens max_new_tokens Throughput (tokens/second) Number of HPU graphs Memory allocated (GB) Max memory allocated (GB) Total memory available (GB) Graph compilation duration (seconds) Error
GPT-J-6B Measure 1 128 128 87.77 21 11.52 11.57 94.62 6.44
GPT-J-6B Quantization 1024 128 128 RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::52848230400 (50400)MB
GPT-J-6B Quantization 640 128 128 5822.32 74 56.74 94.62 94.62 115.87
GPT-J-6B Quantization 320 128 128 5654.14 74 31.34 65.78 94.62 85.91
GPT-J-6B Quantization 256 128 128 5717.59 74 26.27 53.58 94.62 80.37
GPT-J-6B Quantization 120 2048 128 RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3
GPT-J-6B Quantization 120 1024 128 1533.06 74 48.53 78.33 94.62 103.88
GPT-J-6B Quantization 120 368 128 3323.75 74 24.32 48.44 94.62 86.0
GPT-J-6B Quantization 120 128 2048 RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3
GPT-J-6B Quantization 120 128 1024 3910.81 74 48.53 78.33 94.62 169.21
GPT-J-6B Quantization 120 128 624 4892.65 74 33.77 70.57 94.62 118.27
GPT-J-6B Quantization 120 128 512 5022.39 74 29.64 61.05 94.62 101.95
GPT-J-6B Quantization 120 128 256 5714.56 74 20.19 38.94 94.62 80.3
GPT-J-6B Quantization 64 2048 128 RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3
GPT-J-6B Quantization 64 1024 128 1464.83 74 28.66 58.54 94.62 84.13
GPT-J-6B Quantization 64 512 128 2402.4 74 18.58 35.04 94.62 73.62
GPT-J-6B Quantization 64 2048 2048 RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 3
GPT-J-6B Quantization 64 1024 1024 2431.85 74 48.18 76.33 94.62 140.18
GPT-J-6B Quantization 64 512 512 3365.81 74 26.15 52.71 94.62 91.72

kplau1128 avatar May 06 '24 17:05 kplau1128

Fixed RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3 issue.

kplau1128 avatar May 07 '24 03:05 kplau1128

@kplau1128 what's the latest test result for this?

libinta avatar May 23 '24 20:05 libinta

@kplau1128 what's the latest test result for this?

@libinta I had updated in GS-103 here is the latest test result:

Model batch_size bucket_size max_input_tokens max_new_tokens Throughput (tokens/second) Number of HPU graphs Memory allocated (GB) Max memory allocated (GB) Total memory available (GB) Graph compilation duration (seconds) Error
GPT-J-6B Measure 1 128 128 88.79 19 11.51 11.53 94.62 6.41
GPT-J-6B Quantization 1024 128 128 RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::26424115200 (25200)MB
GPT-J-6B Quantization 928 128 128 8186.81 72 68.47 94.61 94.62 121.32
GPT-J-6B Quantization 120 128 128 2048 5114.44 492 64.32 70.6 94.62 599.7
GPT-J-6B Quantization 64 2048 128 794.87 72 49.45 77.64 94.62 98.83
GPT-J-6B Quantization 64 128 2048 2048 RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::26424115200 (25200)MB
GPT-J-6B Quantization 60 128 2048 2048 RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::12096000 (11.5356)MB
GPT-J-6B Quantization 56 128 2048 2048 2041.35 492 66.2 90.89 94.62 625.4

kplau1128 avatar May 24 '24 01:05 kplau1128

This PR replaced by https://github.com/huggingface/optimum-habana/pull/1094

kplau1128 avatar Jul 15 '24 18:07 kplau1128