flash-attention icon indicating copy to clipboard operation
flash-attention copied to clipboard

Flash attention hangs when running an openchat model inside a docker container

Open zeionara opened this issue 8 months ago • 1 comments
trafficstars

Hi, I've made the following dockerfile for configuring dependencies and running an openchat model which is using flash attention. However, it hangs on startup.

from nvidia/cuda:12.4.0-devel-ubuntu22.04

run apt-get update && apt-get install python3-pip -y && apt-get clean
run pip3 install packaging torch && pip3 install ochat && pip3 cache purge

run apt-get install git -y
run pip3 install flash_attn==2.5.8

entrypoint python3 -m ochat.serving.openai_api_server --model $model --host 0.0.0.0 --port $port

The following log is emitted, after which the container hangs and I can't even stop it with sudo docker stop.

INFO 03-10 13:58:32 __init__.py:207] Automatically detected platform cuda.
2025-03-10 13:58:33,122	WARNING services.py:2022 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67100672 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2025-03-10 13:58:33,250	INFO worker.py:1821 -- Started a local Ray instance.
INFO 03-10 13:58:40 config.py:549] This model supports multiple tasks: {'embed', 'reward', 'score', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 03-10 13:58:40 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='openchat/openchat-3.5-0106-gemma', speculative_config=None, tokenizer='openchat/openchat-3.5-0106-gemma', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=openchat/openchat-3.5-0106-gemma, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
INFO 03-10 13:58:43 cuda.py:229] Using Flash Attention backend.

Pls help me fix this.

zeionara avatar Mar 10 '25 14:03 zeionara

It could be any other code that's hanging.

tridao avatar Mar 13 '25 04:03 tridao