text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

How can I deactivate Flash Attention?

Open kissngg opened this issue 2 years ago • 2 comments

System Info

I understand that you are experiencing the following error:

vbnet Copy code Server error: Expected (head_size % 8 == 0) && (head_size <= 128) to be true, but got false. And you are looking for a way to deactivate Flash Attention to resolve this issue. To report this issue on GitHub, you can mention the error you encountered and provide the necessary information, including the version of the polyglot5.8b package you are using. Here is an example of how you can describe the issue in English:

Issue Description: I encountered an error while using the polyglot5.8b package, specifically when running the code. The error message I received is as follows:

vbnet Copy code Server error: Expected (head_size % 8 == 0) && (head_size <= 128) to be true, but got false. Steps to Reproduce:

Install the polyglot5.8b package. Run the code that triggers the error. Observe the error message mentioned above. Expected Behavior: I expected the code to run without any errors and provide the desired output.

Additional Information:

Python version: [Insert Python version here] polyglot5.8b package version: [Insert package version here] Any other relevant information or code snippets that could help in resolving the issue. By providing the above information, you will be able to clearly communicate the problem you encountered and assist the GitHub community in understanding and addressing the issue effectively.

Information

  • [ ] Docker
  • [ ] The CLI directly

Tasks

  • [ ] An officially supported command
  • [ ] My own modifications

Reproduction

error message

2023-06-08T15:02:01.057346Z ERROR shard-manager: text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
    server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 20, in intercept
    return await response
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 61, in Prefill
    generations, next_batch = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 589, in generate_token
    out, present = self.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 561, in forward
    return self.model.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 486, in forward
    hidden_states, present = self.gpt_neox(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 420, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 253, in forward
    attn_output = self.attention(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_neox_modeling.py", line 117, in forward
    flash_attn_cuda.fwd(
RuntimeError: Expected (head_size % 8 == 0) && (head_size <= 128) to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
 rank=0
2023-06-08T15:02:01.057640Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: Expected (head_size % 8 == 0) && (head_size <= 128) to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
2023-06-08T15:02:01.058762Z ERROR HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=127.0.0.1:8080 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 otel.kind=server trace_id=9f79ad18fff5fc85f72bf484890eecc3}:generate{parameters=GenerateParameters { best_of: Some(1), temperature: Some(0.5), repetition_penalty: Some(1.03), top_k: Some(10), top_p: Some(0.95), typical_p: Some(0.95), do_sample: true, max_new_tokens: 20, return_full_text: Some(false), stop: ["photographer"], truncate: None, watermark: true, details: true, decoder_input_details: true, seed: None }}:generate{request=GenerateRequest { inputs: "My name is Olivier and I", parameters: GenerateParameters { best_of: Some(1), temperature: Some(0.5), repetition_penalty: Some(1.03), top_k: Some(10), top_p: Some(0.95), typical_p: Some(0.95), do_sample: true, max_new_tokens: 20, return_full_text: Some(false), stop: ["photographer"], truncate: None, watermark: true, details: true, decoder_input_details: true, seed: None } }}:generate_stream{request=GenerateRequest { inputs: "My name is Olivier and I", parameters: GenerateParameters { best_of: Some(1), temperature: Some(0.5), repetition_penalty: Some(1.03), top_k: Some(10), top_p: Some(0.95), typical_p: Some(0.95), do_sample: true, max_new_tokens: 20, return_full_text: Some(false), stop: ["photographer"], truncate: None, watermark: true, details: true, decoder_input_details: true, seed: None } }}:infer:send_error: text_generation_router::infer: router/src/infer.rs:533: Request failed during generation: Server error: Expected (head_size % 8 == 0) && (head_size <= 128) to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)

Expected behavior

https://github.com/huggingface/text-generation-inference/issues/236

I have a similar error in this article, but there seems to be no explanation on how to solve it. Please help.

kissngg avatar Jun 08 '23 12:06 kissngg

Please provide the necessary information.

Narsil avatar Jun 08 '23 14:06 Narsil

필요한 정보를 제공해 주십시오.

As with the issue found at https://github.com/huggingface/text-generation-inference/issues/236, I also tried to run polyglot5.8b, but encountered the same error and could not execute it. I don't know how to solve this. LLM model works up to sharding, but an error occurs when I make an API request to the LLM server.

kissngg avatar Jun 08 '23 14:06 kissngg

Sorry, we need the information suggested in the New issue prompt. Everything about your environment and what commands you are running.

I am closing this for now since it's impossible to work on this issue in the current state.

Feel free to reopen when you provide everything we need to reproduce on our end. (Check the new issue for all the details)

Narsil avatar Jun 08 '23 21:06 Narsil