What does this PR do?

Add GPT-J FP8 supported

Command Lines:

Measure the tensor quantization statistic on EleutherAI/gpt-j-6b

QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py \
  --model_name_or_path EleutherAI/gpt-j-6b \
  --use_hpu_graphs \
  --use_kv_cache \
  --limit_hpu_graphs \
  --max_input_tokens 128 \
  --max_new_tokens 128 \
  --batch_size 1 \
  --bf16

Quantize the model based on previous measurements for EleutherAI/gpt-j-6b

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py \
--model_name_or_path EleutherAI/gpt-j-6b \
--use_hpu_graphs \
--use_kv_cache \
--limit_hpu_graphs \
--max_input_tokens 128 \
--max_new_tokens 128 \
--batch_size 256 \
--bf16 \
--fp8

Apr 22 '24 16:04 kplau1128

@kplau1128 , Can you please add some command lines to the PR description, that was used to test this

@ssarkar2 Command lines for measure and quantization added on description.

Apr 24 '24 18:04 kplau1128

@kplau1128 , This PR seems to be in "Draft" mode. Can you please more it to regular PR if its ready

Apr 25 '24 17:04 ssarkar2

@kplau1128 , This PR seems to be in "Draft" mode. Can you please more it to regular PR if its ready

@ssarkar2 We seeing output is incorrect when run quantization mode, debug in progress, once identify the cause and fix will update to ready.

Did you see this issue before? Any guide suggestion for debugging this issue?

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ("DeepSpeed is a machine learning framework for--------------'p-game-p-game-factors-game-p-factors-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-p-game-not-not-",)

input 2: ('He is working on',)
output 1: ('He is working on a- a- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -',)

input 3: ('He has a',)
output 1: ('He has a lot,, a, a,operoperoperoperoper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper-oper',)

Apr 25 '24 18:04 kplau1128

Resolved output incorrect issue

Apr 26 '24 02:04 kplau1128

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Apr 30 '24 21:04 HuggingFaceDocBuilderDev

Can you please run bf16 tests and make sure you are getting similar numbers with and without this PR as well. for example a couple of configs like 128->128, 128->2048, 2048->128, 2048->2048 (i->o, meaning max_input_tokens=i and max_new_tokens=o)

also can you please run make style at the topmost folder, because the PR is failing style check

Apr 30 '24 21:04 ssarkar2

Can you please run bf16 tests and make sure you are getting similar numbers with and without this PR as well. for example a couple of configs like 128->128, 128->2048, 2048->128, 2048->2048 (i->o, meaning max_input_tokens=i and max_new_tokens=o)

also can you please run make style at the topmost folder, because the PR is failing style check

Style checked.

Also add commit with added --reuse_cache option support.

May 01 '24 23:05 kplau1128

Test with this PR seeing throughput improve, but still low.

Also see main GPT-J code already has The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3 issue, when either max_input_tokens or max_new_tokens set to 2048

Without this PR:

Model	batch_size	max_input_tokens	max_new_tokens	Throughput (tokens/second)	Number of HPU graphs	Memory allocated (GB)	Max memory allocated (GB)	Total memory available (GB)	Graph compilation duration (seconds)	Error
GPT-J-6B Measure	1	128	128	147.26	11	11.52	11.68	94.62	8.71
GPT-J-6B Quantization	1024	128	128							RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::2147483648 (2048)MB
GPT-J-6B Quantization	640	128	128							RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::33030144000 (31500)MB
GPT-J-6B Quantization	320	128	128							RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::16515072000 (15750)MB
GPT-J-6B Quantization	256	128	128	4978.55	15	40.21	86.68	94.62	80.36
GPT-J-6B Quantization	120	2048	128							RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3
GPT-J-6B Quantization	120	1024	128							RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::27869184000 (26578.1)MB
GPT-J-6B Quantization	120	368	128	2823.54	15	37.01	76.51	94.62	78.28
GPT-J-6B Quantization	120	128	2048							RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3
GPT-J-6B Quantization	120	128	1024							RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::27869184000 (26578.1)MB
GPT-J-6B Quantization	120	128	624							RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::18192384000 (17349.6)MB
GPT-J-6B Quantization	120	128	512	3706.45	15	46.02	94.62	94.62	113.97
GPT-J-6B Quantization	120	128	256	4687.05	15	30.01	62.27	94.62	77.36
GPT-J-6B Quantization	64	2048	128							RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3
GPT-J-6B Quantization	64	1024	128	1183.81	15	44.39	91.28	94.62	103.23
GPT-J-6B Quantization	64	512	128	2032.33	15	27.35	54.08	94.62	70.22
GPT-J-6B Quantization	64	2048	2048							RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 3
GPT-J-6B Quantization	64	1024	1024							RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::26424115200 (25200)MB
GPT-J-6B Quantization	64	512	512	2514.55	15	40.12	82.57	94.62	113.1

With this PR

Model	batch_size	max_input_tokens	max_new_tokens	Throughput (tokens/second)	Number of HPU graphs	Memory allocated (GB)	Max memory allocated (GB)	Total memory available (GB)	Graph compilation duration (seconds)	Error
GPT-J-6B Measure	1	128	128	87.77	21	11.52	11.57	94.62	6.44
GPT-J-6B Quantization	1024	128	128							RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::52848230400 (50400)MB
GPT-J-6B Quantization	640	128	128	5822.32	74	56.74	94.62	94.62	115.87
GPT-J-6B Quantization	320	128	128	5654.14	74	31.34	65.78	94.62	85.91
GPT-J-6B Quantization	256	128	128	5717.59	74	26.27	53.58	94.62	80.37
GPT-J-6B Quantization	120	2048	128							RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3
GPT-J-6B Quantization	120	1024	128	1533.06	74	48.53	78.33	94.62	103.88
GPT-J-6B Quantization	120	368	128	3323.75	74	24.32	48.44	94.62	86.0
GPT-J-6B Quantization	120	128	2048							RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3
GPT-J-6B Quantization	120	128	1024	3910.81	74	48.53	78.33	94.62	169.21
GPT-J-6B Quantization	120	128	624	4892.65	74	33.77	70.57	94.62	118.27
GPT-J-6B Quantization	120	128	512	5022.39	74	29.64	61.05	94.62	101.95
GPT-J-6B Quantization	120	128	256	5714.56	74	20.19	38.94	94.62	80.3
GPT-J-6B Quantization	64	2048	128							RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3
GPT-J-6B Quantization	64	1024	128	1464.83	74	28.66	58.54	94.62	84.13
GPT-J-6B Quantization	64	512	128	2402.4	74	18.58	35.04	94.62	73.62
GPT-J-6B Quantization	64	2048	2048							RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 3
GPT-J-6B Quantization	64	1024	1024	2431.85	74	48.18	76.33	94.62	140.18
GPT-J-6B Quantization	64	512	512	3365.81	74	26.15	52.71	94.62	91.72

May 06 '24 17:05 kplau1128

Fixed RuntimeError: The size of tensor a (2048) must match the size of tensor b (2176) at non-singleton dimension 3 issue.

May 07 '24 03:05 kplau1128

@kplau1128 what's the latest test result for this?

May 23 '24 20:05 libinta

@kplau1128 what's the latest test result for this?

@libinta I had updated in GS-103 here is the latest test result:

Model	batch_size	bucket_size	max_input_tokens	max_new_tokens	Throughput (tokens/second)	Number of HPU graphs	Memory allocated (GB)	Max memory allocated (GB)	Total memory available (GB)	Graph compilation duration (seconds)	Error
GPT-J-6B Measure	1		128	128	88.79	19	11.51	11.53	94.62	6.41
GPT-J-6B Quantization	1024		128	128							RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::26424115200 (25200)MB
GPT-J-6B Quantization	928		128	128	8186.81	72	68.47	94.61	94.62	121.32
GPT-J-6B Quantization	120	128	128	2048	5114.44	492	64.32	70.6	94.62	599.7
GPT-J-6B Quantization	64		2048	128	794.87	72	49.45	77.64	94.62	98.83
GPT-J-6B Quantization	64	128	2048	2048							RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::26424115200 (25200)MB
GPT-J-6B Quantization	60	128	2048	2048							RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::12096000 (11.5356)MB
GPT-J-6B Quantization	56	128	2048	2048	2041.35	492	66.2	90.89	94.62	625.4

May 24 '24 01:05 kplau1128

This PR replaced by https://github.com/huggingface/optimum-habana/pull/1094

Jul 15 '24 18:07 kplau1128

optimum-habana
optimum-habana copied to clipboard

Added GPT-J FP8 support

What does this PR do?

Command Lines:

Without this PR:

With this PR

optimum-habana optimum-habana copied to clipboard

Added GPT-J FP8 support

What does this PR do?

Command Lines:

Without this PR:

With this PR

optimum-habana
optimum-habana copied to clipboard