optimum-habana
optimum-habana copied to clipboard
Added GPT-J FP8 support
What does this PR do?
Add GPT-J FP8 supported
Command Lines:
-
Measure the tensor quantization statistic on
EleutherAI/gpt-j-6b
QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py \ --model_name_or_path EleutherAI/gpt-j-6b \ --use_hpu_graphs \ --use_kv_cache \ --limit_hpu_graphs \ --max_input_tokens 128 \ --max_new_tokens 128 \ --batch_size 1 \ --bf16
-
Quantize the model based on previous measurements for
EleutherAI/gpt-j-6b
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py \ --model_name_or_path EleutherAI/gpt-j-6b \ --use_hpu_graphs \ --use_kv_cache \ --limit_hpu_graphs \ --max_input_tokens 128 \ --max_new_tokens 128 \ --batch_size 256 \ --bf16 \ --fp8
@kplau1128 , Can you please add some command lines to the PR description, that was used to test this
@ssarkar2 Command lines for measure and quantization added on description.
@kplau1128 , This PR seems to be in "Draft" mode. Can you please more it to regular PR if its ready
@kplau1128 , This PR seems to be in "Draft" mode. Can you please more it to regular PR if its ready
@ssarkar2 We seeing output is incorrect when run quantization mode, debug in progress, once identify the cause and fix will update to ready.
Did you see this issue before? Any guide suggestion for debugging this issue?
Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ("DeepSpeed is a machine learning framework for--------------'p-game-p-game-factors-game-p-factors-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-not-",)
input 2: ('He is working on',)
output 1: ('He is working on a- a- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -',)
input 3: ('He has a',)
output 1: ('He has a lot,, a, a,operoperoperoperoper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper',)
Resolved output incorrect issue
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Can you please run bf16 tests and make sure you are getting similar numbers with and without this PR as well. for example a couple of configs like 128->128, 128->2048, 2048->128, 2048->2048 (i->o, meaning max_input_tokens=i and max_new_tokens=o)
also can you please run make style
at the topmost folder, because the PR is failing style check
Can you please run bf16 tests and make sure you are getting similar numbers with and without this PR as well. for example a couple of configs like 128->128, 128->2048, 2048->128, 2048->2048 (i->o, meaning max_input_tokens=i and max_new_tokens=o)
also can you please run
make style
at the topmost folder, because the PR is failing style check
Style checked.
Also add commit with added --reuse_cache
option support.
Test with this PR seeing throughput improve, but still low.
Also see main GPT-J code already has The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3
issue, when either max_input_tokens
or max_new_tokens
set to 2048
Without this PR:
Model | batch_size | max_input_tokens | max_new_tokens | Throughput (tokens/second) | Number of HPU graphs | Memory allocated (GB) | Max memory allocated (GB) | Total memory available (GB) | Graph compilation duration (seconds) | Error |
---|---|---|---|---|---|---|---|---|---|---|
GPT-J-6B Measure | 1 | 128 | 128 | 147.26 | 11 | 11.52 | 11.68 | 94.62 | 8.71 | |
GPT-J-6B Quantization | 1024 | 128 | 128 | RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2147483648 (2048)MB | ||||||
GPT-J-6B Quantization | 640 | 128 | 128 | RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::33030144000 (31500)MB | ||||||
GPT-J-6B Quantization | 320 | 128 | 128 | RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::16515072000 (15750)MB | ||||||
GPT-J-6B Quantization | 256 | 128 | 128 | 4978.55 | 15 | 40.21 | 86.68 | 94.62 | 80.36 | |
GPT-J-6B Quantization | 120 | 2048 | 128 | RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3 | ||||||
GPT-J-6B Quantization | 120 | 1024 | 128 | RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::27869184000 (26578.1)MB | ||||||
GPT-J-6B Quantization | 120 | 368 | 128 | 2823.54 | 15 | 37.01 | 76.51 | 94.62 | 78.28 | |
GPT-J-6B Quantization | 120 | 128 | 2048 | RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3 | ||||||
GPT-J-6B Quantization | 120 | 128 | 1024 | RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::27869184000 (26578.1)MB | ||||||
GPT-J-6B Quantization | 120 | 128 | 624 | RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::18192384000 (17349.6)MB | ||||||
GPT-J-6B Quantization | 120 | 128 | 512 | 3706.45 | 15 | 46.02 | 94.62 | 94.62 | 113.97 | |
GPT-J-6B Quantization | 120 | 128 | 256 | 4687.05 | 15 | 30.01 | 62.27 | 94.62 | 77.36 | |
GPT-J-6B Quantization | 64 | 2048 | 128 | RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3 | ||||||
GPT-J-6B Quantization | 64 | 1024 | 128 | 1183.81 | 15 | 44.39 | 91.28 | 94.62 | 103.23 | |
GPT-J-6B Quantization | 64 | 512 | 128 | 2032.33 | 15 | 27.35 | 54.08 | 94.62 | 70.22 | |
GPT-J-6B Quantization | 64 | 2048 | 2048 | RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 3 | ||||||
GPT-J-6B Quantization | 64 | 1024 | 1024 | RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::26424115200 (25200)MB | ||||||
GPT-J-6B Quantization | 64 | 512 | 512 | 2514.55 | 15 | 40.12 | 82.57 | 94.62 | 113.1 |
With this PR
Model | batch_size | max_input_tokens | max_new_tokens | Throughput (tokens/second) | Number of HPU graphs | Memory allocated (GB) | Max memory allocated (GB) | Total memory available (GB) | Graph compilation duration (seconds) | Error |
---|---|---|---|---|---|---|---|---|---|---|
GPT-J-6B Measure | 1 | 128 | 128 | 87.77 | 21 | 11.52 | 11.57 | 94.62 | 6.44 | |
GPT-J-6B Quantization | 1024 | 128 | 128 | RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::52848230400 (50400)MB | ||||||
GPT-J-6B Quantization | 640 | 128 | 128 | 5822.32 | 74 | 56.74 | 94.62 | 94.62 | 115.87 | |
GPT-J-6B Quantization | 320 | 128 | 128 | 5654.14 | 74 | 31.34 | 65.78 | 94.62 | 85.91 | |
GPT-J-6B Quantization | 256 | 128 | 128 | 5717.59 | 74 | 26.27 | 53.58 | 94.62 | 80.37 | |
GPT-J-6B Quantization | 120 | 2048 | 128 | RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3 | ||||||
GPT-J-6B Quantization | 120 | 1024 | 128 | 1533.06 | 74 | 48.53 | 78.33 | 94.62 | 103.88 | |
GPT-J-6B Quantization | 120 | 368 | 128 | 3323.75 | 74 | 24.32 | 48.44 | 94.62 | 86.0 | |
GPT-J-6B Quantization | 120 | 128 | 2048 | RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3 | ||||||
GPT-J-6B Quantization | 120 | 128 | 1024 | 3910.81 | 74 | 48.53 | 78.33 | 94.62 | 169.21 | |
GPT-J-6B Quantization | 120 | 128 | 624 | 4892.65 | 74 | 33.77 | 70.57 | 94.62 | 118.27 | |
GPT-J-6B Quantization | 120 | 128 | 512 | 5022.39 | 74 | 29.64 | 61.05 | 94.62 | 101.95 | |
GPT-J-6B Quantization | 120 | 128 | 256 | 5714.56 | 74 | 20.19 | 38.94 | 94.62 | 80.3 | |
GPT-J-6B Quantization | 64 | 2048 | 128 | RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3 | ||||||
GPT-J-6B Quantization | 64 | 1024 | 128 | 1464.83 | 74 | 28.66 | 58.54 | 94.62 | 84.13 | |
GPT-J-6B Quantization | 64 | 512 | 128 | 2402.4 | 74 | 18.58 | 35.04 | 94.62 | 73.62 | |
GPT-J-6B Quantization | 64 | 2048 | 2048 | RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 3 | ||||||
GPT-J-6B Quantization | 64 | 1024 | 1024 | 2431.85 | 74 | 48.18 | 76.33 | 94.62 | 140.18 | |
GPT-J-6B Quantization | 64 | 512 | 512 | 3365.81 | 74 | 26.15 | 52.71 | 94.62 | 91.72 |
Fixed RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3
issue.
@kplau1128 what's the latest test result for this?
@kplau1128 what's the latest test result for this?
@libinta I had updated in GS-103 here is the latest test result:
Model | batch_size | bucket_size | max_input_tokens | max_new_tokens | Throughput (tokens/second) | Number of HPU graphs | Memory allocated (GB) | Max memory allocated (GB) | Total memory available (GB) | Graph compilation duration (seconds) | Error |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT-J-6B Measure | 1 | 128 | 128 | 88.79 | 19 | 11.51 | 11.53 | 94.62 | 6.41 | ||
GPT-J-6B Quantization | 1024 | 128 | 128 | RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::26424115200 (25200)MB | |||||||
GPT-J-6B Quantization | 928 | 128 | 128 | 8186.81 | 72 | 68.47 | 94.61 | 94.62 | 121.32 | ||
GPT-J-6B Quantization | 120 | 128 | 128 | 2048 | 5114.44 | 492 | 64.32 | 70.6 | 94.62 | 599.7 | |
GPT-J-6B Quantization | 64 | 2048 | 128 | 794.87 | 72 | 49.45 | 77.64 | 94.62 | 98.83 | ||
GPT-J-6B Quantization | 64 | 128 | 2048 | 2048 | RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::26424115200 (25200)MB | ||||||
GPT-J-6B Quantization | 60 | 128 | 2048 | 2048 | RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::12096000 (11.5356)MB | ||||||
GPT-J-6B Quantization | 56 | 128 | 2048 | 2048 | 2041.35 | 492 | 66.2 | 90.89 | 94.62 | 625.4 |
This PR replaced by https://github.com/huggingface/optimum-habana/pull/1094