text-generation-inference
text-generation-inference copied to clipboard
How can I deactivate Flash Attention?
System Info
I understand that you are experiencing the following error:
vbnet Copy code Server error: Expected (head_size % 8 == 0) && (head_size <= 128) to be true, but got false. And you are looking for a way to deactivate Flash Attention to resolve this issue. To report this issue on GitHub, you can mention the error you encountered and provide the necessary information, including the version of the polyglot5.8b package you are using. Here is an example of how you can describe the issue in English:
Issue Description: I encountered an error while using the polyglot5.8b package, specifically when running the code. The error message I received is as follows:
vbnet Copy code Server error: Expected (head_size % 8 == 0) && (head_size <= 128) to be true, but got false. Steps to Reproduce:
Install the polyglot5.8b package. Run the code that triggers the error. Observe the error message mentioned above. Expected Behavior: I expected the code to run without any errors and provide the desired output.
Additional Information:
Python version: [Insert Python version here] polyglot5.8b package version: [Insert package version here] Any other relevant information or code snippets that could help in resolving the issue. By providing the above information, you will be able to clearly communicate the problem you encountered and assist the GitHub community in understanding and addressing the issue effectively.
Information
- [ ] Docker
- [ ] The CLI directly
Tasks
- [ ] An officially supported command
- [ ] My own modifications
Reproduction
error message
2023-06-08T15:02:01.057346Z ERROR shard-manager: text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 20, in intercept
return await response
File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 61, in Prefill
generations, next_batch = self.model.generate_token(batch)
File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 589, in generate_token
out, present = self.forward(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 561, in forward
return self.model.forward(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 486, in forward
hidden_states, present = self.gpt_neox(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 420, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 253, in forward
attn_output = self.attention(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 117, in forward
flash_attn_cuda.fwd(
RuntimeError: Expected (head_size % 8 == 0) && (head_size <= 128) to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
rank=0
2023-06-08T15:02:01.057640Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: Expected (head_size % 8 == 0) && (head_size <= 128) to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
2023-06-08T15:02:01.058762Z ERROR HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=127.0.0.1:8080 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 otel.kind=server trace_id=9f79ad18fff5fc85f72bf484890eecc3}:generate{parameters=GenerateParameters { best_of: Some(1), temperature: Some(0.5), repetition_penalty: Some(1.03), top_k: Some(10), top_p: Some(0.95), typical_p: Some(0.95), do_sample: true, max_new_tokens: 20, return_full_text: Some(false), stop: ["photographer"], truncate: None, watermark: true, details: true, decoder_input_details: true, seed: None }}:generate{request=GenerateRequest { inputs: "My name is Olivier and I", parameters: GenerateParameters { best_of: Some(1), temperature: Some(0.5), repetition_penalty: Some(1.03), top_k: Some(10), top_p: Some(0.95), typical_p: Some(0.95), do_sample: true, max_new_tokens: 20, return_full_text: Some(false), stop: ["photographer"], truncate: None, watermark: true, details: true, decoder_input_details: true, seed: None } }}:generate_stream{request=GenerateRequest { inputs: "My name is Olivier and I", parameters: GenerateParameters { best_of: Some(1), temperature: Some(0.5), repetition_penalty: Some(1.03), top_k: Some(10), top_p: Some(0.95), typical_p: Some(0.95), do_sample: true, max_new_tokens: 20, return_full_text: Some(false), stop: ["photographer"], truncate: None, watermark: true, details: true, decoder_input_details: true, seed: None } }}:infer:send_error: text_generation_router::infer: router/src/infer.rs:533: Request failed during generation: Server error: Expected (head_size % 8 == 0) && (head_size <= 128) to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
Expected behavior
https://github.com/huggingface/text-generation-inference/issues/236
I have a similar error in this article, but there seems to be no explanation on how to solve it. Please help.
Please provide the necessary information.
필요한 정보를 제공해 주십시오.
As with the issue found at https://github.com/huggingface/text-generation-inference/issues/236, I also tried to run polyglot5.8b, but encountered the same error and could not execute it. I don't know how to solve this. LLM model works up to sharding, but an error occurs when I make an API request to the LLM server.
Sorry, we need the information suggested in the New issue prompt. Everything about your environment and what commands you are running.
I am closing this for now since it's impossible to work on this issue in the current state.
Feel free to reopen when you provide everything we need to reproduce on our end. (Check the new issue for all the details)