Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

inference-workers exits when trying other models than distilgpt2 (on a non-GPU system)

Open stelterlab opened this issue 1 year ago • 2 comments

Is it possible to run the worker with other models than distilgpt2 on a non GPU-system?

After successfully launching the services (profiles ci + inference) with the distilgpt2 model, I tried to start it for other models (ex. OA_SFT_Pythia_12B_4), but the inference-workers container fails after waiting for the inference server to be ready.

The inference-server reports that it has started:

2023-04-30 15:19:04.225 | WARNING  | oasst_inference_server.routes.workers:clear_worker_sessions:288 - Clearing worker sessions
2023-04-30 15:19:04.227 | WARNING  | oasst_inference_server.routes.workers:clear_worker_sessions:291 - Successfully cleared worker sessions
2023-04-30 15:19:04.227 | WARNING  | main:welcome_message:119 - Inference server started
2023-04-30 15:19:04.227 | WARNING  | main:welcome_message:120 - To stop the server, press Ctrl+C

but the inference-worker stops after a minute of waiting:

2023-04-30T15:22:39.170299Z  INFO text_generation_launcher: Starting shard 0
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2023-04-30 15:22:40.215 | INFO     | __main__:main:25 - Inference protocol version: 1
2023-04-30 15:22:40.215 | WARNING  | __main__:main:28 - Model config: model_id='OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5' max_input_length=1024 max_total_length=2048 quantized=False
2023-04-30 15:22:40.756 | WARNING  | __main__:main:37 - Tokenizer OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 vocab size: 50254
2023-04-30 15:22:40.759 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 6.22 seconds
2023-04-30 15:22:46.991 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 1.95 seconds
2023-04-30 15:22:48.947 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 5.09 seconds
2023-04-30T15:22:49.194599Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-04-30 15:22:54.040 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 7.65 seconds
2023-04-30T15:22:59.210490Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-04-30 15:23:01.699 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 2.74 seconds
2023-04-30 15:23:04.442 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 7.90 seconds
2023-04-30T15:23:09.226492Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-04-30 15:23:12.356 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 4.10 seconds
2023-04-30 15:23:16.460 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 3.25 seconds
2023-04-30T15:23:19.238026Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-04-30 15:23:19.718 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 3.76 seconds
2023-04-30 15:23:23.479 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 0.70 seconds
2023-04-30 15:23:24.182 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 3.74 seconds
2023-04-30 15:23:27.929 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 3.04 seconds
2023-04-30T15:23:29.248026Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-04-30 15:23:30.976 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 6.88 seconds
2023-04-30 15:23:37.864 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 2.89 seconds
2023-04-30T15:23:39.259110Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-04-30 15:23:40.757 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 5.52 seconds
2023-04-30 15:23:46.287 | WARNING  | utils:wait_for_inference_server:71 - Inference server not ready. Retrying in 7.90 seconds
2023-04-30T15:23:49.288384Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-04-30T15:23:51.887480Z ERROR text_generation_launcher: Shard 0 failed to start:
/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/bitsandbytes/cextension.py:127: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
We're not using custom kernels.

2023-04-30T15:23:51.887567Z  INFO text_generation_launcher: Shutting down shards

The system running the container is an OpenStack Instance with 8 vCPUs and 32 GB vRAM running Ubuntu 22.04. I have a pile of vCPUs and vRAM, but sadly no GPU yet to run tests.

Before running the "docker compose up" I just set MODEL_CONFIG_NAME to OA_SFT_Pythia_12B_4 as env var.

This message baffles me: "None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used." - shouldn't there be a last one of them? (I assume they fall out of the "huggingface/transformers" requirement)

Thanks in advance!

stelterlab avatar Apr 30 '23 16:04 stelterlab

Hi,

Did you manage to make it work. I have the same situation.

darfire avatar May 09 '23 13:05 darfire

No. I hoped for some feedback whether it is possible as in other solutions like fastchat (which does have an option for that). But after some testing with fastchat I learned that larger models (7B/13B) need a GPU to get a decent response time.

So I got my hands on an instance of a cloud gpu service provider with reasonable prices (like Lambda Labs) and tested there (not open assistant yet). Now that I know what the memory usage of some models are (7B ~8GB, 13B ~28 GB) I'm thinking of an effordable desktop GPU with 10/12 GB to play with smaller models (while dreaming of an NVIDIA A100 ;-).

https://cloud-gpus.com/ - for an overview of providers

stelterlab avatar May 09 '23 14:05 stelterlab