text-generation-inference Option to use CPU instead

Feature request

I'd like to run this on CPU

Motivation

Proof of concept

Your contribution

not sure if I'm doing something wrong or if the codebase is intended to be only runnable on gpu, seeing a lot of gpu references like

https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs#L445

getting

{"error":"Request failed during generation: Server error: attention_scores_2d must be a CUDA tensor","error_type":"generation"}

when running this on GCP:

gcloud run deploy text-generation-inference \
--allow-unauthenticated \
--project=${PROJECT_ID} \
--image=${LOCATION}-docker.pkg.dev/${PROJECT_ID}/${REPOSITORY}/text-generation-inference:latest \
--platform=managed \
--region=us-central1 \
--cpu=4 \
--memory=8Gi \
--set-env-vars=TRANSFORMERS_CACHE=/tmp \
--set-env-vars=HF_MODEL_ID=bigscience/bloom

URL=$(gcloud run services describe text-generation-inference --region=us-central1 --format="value(status.url)")
curl ${URL}/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
    -H 'Content-Type: application/json'

May 17 '23 08:05 louis030195

i use CUDA_VISIBLE_DEVICES= empty it is working on cpu

May 17 '23 15:05 dongs0104

If I may I'd like to add to this issue. I do think there needs to be a little bit more control over devices.

I like to run everything locally on my own GPU. Currently the device mapping appears to be auto

model = AutoModelForCausalLM.from_pretrained(
            model_id,
            revision=revision,
            torch_dtype=dtype,
            device_map="auto"
            if torch.cuda.is_available() and torch.cuda.device_count() > 1
            else None,
            load_in_8bit=quantize == "bitsandbytes",
            trust_remote_code=trust_remote_code,
        )

Which would basically use all of the GPU and completely ignore any other devices. Since i have a fairly small GPU (8 GB VRAM) this means that only small models work, though I'm totally fine offloading to the CPU for the little overhead.

HF does have a mechanism in place to use a custom device_map: https://huggingface.co/docs/accelerate/usage_guides/big_modeling

I propose creating an environment variable DEVICE_MAP that would be a JSON string. If given, then it would be passed in to DEVICE_MAP. Otherwise, it would be auto.

Jun 13 '23 15:06 src-r-r

i tried setting CUDA_VISIBLE_DEVICES= but it failed

docker run --shm-size 1g --net=host -p 8080:80 -v $PWD/Llama-2-7b-hf:/data -e HUGGING_FACE_HUB_TOKEN=$token -e HF_HUB_ENABLE_HF_TRANSFER=0 -e CUDA_VISIBLE_DEVICES= ghcr.io/huggingface/text-generation-inference:latest  --model-id NousResearch/Llama-2-7b-hf

logs

root@LLM-VM1:/docker# docker run --shm-size 1g --net=host -p 8080:80 -v $PWD/Llama-2-7b-hf:/data -e HUGGING_FACE_HUB_TOKEN=$token -e HF_HUB_ENABLE_HF_TRANSFER=0 -e CUDA_VISIBLE_DEVICES= ghcr.io/huggingface/text-generation-inference:latest  --model-id NousResearch/Llama-2-7b-hf
WARNING: Published ports are discarded when using host network mode
2023-07-27T03:59:31.356541Z  INFO text_generation_launcher: Args { model_id: "NousResearch/Llama-2-7b-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "LLM-VM1", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-07-27T03:59:31.356649Z  INFO download: text_generation_launcher: Starting download process.
2023-07-27T03:59:33.834368Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-07-27T03:59:34.260064Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-07-27T03:59:34.260293Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-07-27T03:59:36.326232Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: CUDA is not available

2023-07-27T03:59:44.269890Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-07-27T03:59:54.278654Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-07-27T03:59:56.910803Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

/opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. " rank=0
2023-07-27T03:59:56.910832Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 9 rank=0
Error: ShardCannotStart
2023-07-27T03:59:56.991007Z ERROR text_generation_launcher: Shard 0 failed to start
2023-07-27T03:59:56.991025Z  INFO text_generation_launcher: Shutting down shards

Jul 27 '23 04:07 ksingh-scogo

Hi,

Disclaimer: CPU support is on 'best-effort' only. The reason is that it's a very different problem space with what we're trying to solve here, things like offloading and such come to mind, which we will not add to keep things consistent and manageable.

My guess is that something is ooming, or panicking somehow, which would be why the actual error is not reported here. Could you try launching text-generation-server serve XXXX to see if that shows a better error message ?

Jul 27 '23 08:07 Narsil

Shard process was signaled to shutdown with signal 9 signal 9 = KILLED. There is a high probability that you don't have enough RAM.

Jul 27 '23 10:07 OlivierDehaene

@OlivierDehaene Thanks for the suggestion. I was doing this on AWS with g5.xlarge instance type and getting OOM , i bumped up the instance type to g5.2xlarge and able to get it running

Jul 30 '23 11:07 ksingh-scogo

text-generation-inference text-generation-inference copied to clipboard

Option to use CPU instead

Feature request

Motivation

Your contribution

text-generation-inference
text-generation-inference copied to clipboard