text-generation-inference
text-generation-inference copied to clipboard
Option to use CPU instead
Feature request
I'd like to run this on CPU
Motivation
Proof of concept
Your contribution
not sure if I'm doing something wrong or if the codebase is intended to be only runnable on gpu, seeing a lot of gpu references like
https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs#L445
getting
{"error":"Request failed during generation: Server error: attention_scores_2d must be a CUDA tensor","error_type":"generation"}
when running this on GCP:
gcloud run deploy text-generation-inference \
--allow-unauthenticated \
--project=${PROJECT_ID} \
--image=${LOCATION}-docker.pkg.dev/${PROJECT_ID}/${REPOSITORY}/text-generation-inference:latest \
--platform=managed \
--region=us-central1 \
--cpu=4 \
--memory=8Gi \
--set-env-vars=TRANSFORMERS_CACHE=/tmp \
--set-env-vars=HF_MODEL_ID=bigscience/bloom
URL=$(gcloud run services describe text-generation-inference --region=us-central1 --format="value(status.url)")
curl ${URL}/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
-H 'Content-Type: application/json'
i use CUDA_VISIBLE_DEVICES= empty it is working on cpu
If I may I'd like to add to this issue. I do think there needs to be a little bit more control over devices.
I like to run everything locally on my own GPU. Currently the device mapping appears to be auto
model = AutoModelForCausalLM.from_pretrained(
model_id,
revision=revision,
torch_dtype=dtype,
device_map="auto"
if torch.cuda.is_available() and torch.cuda.device_count() > 1
else None,
load_in_8bit=quantize == "bitsandbytes",
trust_remote_code=trust_remote_code,
)
Which would basically use all of the GPU and completely ignore any other devices. Since i have a fairly small GPU (8 GB VRAM) this means that only small models work, though I'm totally fine offloading to the CPU for the little overhead.
HF does have a mechanism in place to use a custom device_map: https://huggingface.co/docs/accelerate/usage_guides/big_modeling
I propose creating an environment variable DEVICE_MAP that would be a JSON string. If given, then it would be passed in to DEVICE_MAP. Otherwise, it would be auto.
i tried setting CUDA_VISIBLE_DEVICES= but it failed
docker run --shm-size 1g --net=host -p 8080:80 -v $PWD/Llama-2-7b-hf:/data -e HUGGING_FACE_HUB_TOKEN=$token -e HF_HUB_ENABLE_HF_TRANSFER=0 -e CUDA_VISIBLE_DEVICES= ghcr.io/huggingface/text-generation-inference:latest --model-id NousResearch/Llama-2-7b-hf
- logs
root@LLM-VM1:/docker# docker run --shm-size 1g --net=host -p 8080:80 -v $PWD/Llama-2-7b-hf:/data -e HUGGING_FACE_HUB_TOKEN=$token -e HF_HUB_ENABLE_HF_TRANSFER=0 -e CUDA_VISIBLE_DEVICES= ghcr.io/huggingface/text-generation-inference:latest --model-id NousResearch/Llama-2-7b-hf
WARNING: Published ports are discarded when using host network mode
2023-07-27T03:59:31.356541Z INFO text_generation_launcher: Args { model_id: "NousResearch/Llama-2-7b-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "LLM-VM1", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-07-27T03:59:31.356649Z INFO download: text_generation_launcher: Starting download process.
2023-07-27T03:59:33.834368Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2023-07-27T03:59:34.260064Z INFO download: text_generation_launcher: Successfully downloaded weights.
2023-07-27T03:59:34.260293Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-07-27T03:59:36.326232Z WARN text_generation_launcher: Could not import Flash Attention enabled models: CUDA is not available
2023-07-27T03:59:44.269890Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-07-27T03:59:54.278654Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-07-27T03:59:56.910803Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
/opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. " rank=0
2023-07-27T03:59:56.910832Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 9 rank=0
Error: ShardCannotStart
2023-07-27T03:59:56.991007Z ERROR text_generation_launcher: Shard 0 failed to start
2023-07-27T03:59:56.991025Z INFO text_generation_launcher: Shutting down shards
Hi,
Disclaimer: CPU support is on 'best-effort' only. The reason is that it's a very different problem space with what we're trying to solve here, things like offloading and such come to mind, which we will not add to keep things consistent and manageable.
My guess is that something is ooming, or panicking somehow, which would be why the actual error is not reported here.
Could you try launching text-generation-server serve XXXX to see if that shows a better error message ?
Shard process was signaled to shutdown with signal 9 signal 9 = KILLED. There is a high probability that you don't have enough RAM.
@OlivierDehaene Thanks for the suggestion. I was doing this on AWS with g5.xlarge instance type and getting OOM , i bumped up the instance type to g5.2xlarge and able to get it running