inference chatglm3-6b with int4 and 8k input prompt failed
bigdl-llm: 2.5.0b20240321, all-in-one benchmark tools: 8k prompt refers https://github.com/intel/xFasterTransformer/blob/main/benchmark/prompt.json 2024-03-22 20:38:03,260 - INFO - intel_extension_for_pytorch auto imported Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.52it/s] 2024-03-22 20:38:08,095 - INFO - Converting the current model to sym_int4 format......
loading of model costs 8.393956548999995s and 3.583984375GB <class 'transformers_modules.chatglm3-6b.modeling_chatglm.ChatGLMForConditionalGeneration'> /build/intel-pytorch-extension/csrc/gpu/aten/operators/Indexing.h:670: operator(): global id: [524544,0,0], local id: [256,0,0] Assertion
index >= -sizes[i] && index < sizes[i] && "index out of bounds"failed /build/intel-pytorch-extension/csrc/gpu/aten/operators/Indexing.h:670: operator(): global id: [524545,0,0], local id: [257,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed /build/intel-pytorch-extension/csrc/gpu/aten/operators/Indexing.h:670: operator(): global id: [524546,0,0], local id: [258,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed /build/intel-pytorch-extension/csrc/gpu/aten/operators/Indexing.h:670: operator(): global id: [524547,0,0], local id: [259,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed /build/intel-pytorch-extension/csrc/gpu/aten/operators/Indexing.h:670: operator(): global id: [524548,0,0], local id: [260,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed /build/intel-pytorch-extension/csrc/gpu/aten/operators/Indexing.h:670: operator(): global id: [524549,0,0], local id: [261,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds"failed
Should be the same issue as https://github.com/intel-analytics/ipex-llm/issues/10513
As we test, when the input length is larger than 8166, it will show the same error as above. When the input length is smaller or equal to 8166, it will show the error that is similar to llama2 8k issue, which is the IPEX allocation error:
The root cause should be the same. Need further investigation.