text-generation-webui Implement ZeRO inference

Some information Seems like ZeRO inference could improve the performance of offloading to RAM/NVMe. I don't know if huggingface's accelerate is already using it, but if not, it would be a great feature to add.

Feb 01 '23 02:02 Silver267

This is definitely worth looking into. There is also ONNX, which has a similar promise of improving inference speeds.

I have never really understood how to use those engines and how production-ready they are compared to transformers.

Feb 01 '23 02:02 oobabooga

Here's a draft with some ideas: #43

There would be 3 new arguments for server.py:

--deepspeed enables DeepSpeed ZeRO-3 inference with CPU offloading
--nvme-offload-dir optionally points to an offload directory that should be on a NVME drive
--bf16 for bfloat16 precision if you have an Ampere GPU, otherwise omit it and float16 is used

DeepSpeed must be installed: pip install deepspeed==0.8.0

Using --nvme-offload-dir requires installing libaio-dev. For example: apt install libaio-dev

Running should be done with the DeepSpeed launcher:

deepspeed --num_gpus=1 server.py --cai-chat --deepspeed

For NVME offloading:

deepspeed --num_gpus=1 server.py --cai-chat --deepspeed --nvme-offload-dir /mnt/offload

YMMV. While I haven't tested multi-GPU setups yet, in my tests the VRAM usage on a single card was greatly optimized.

Feb 01 '23 13:02 ghost

That's very exciting. I will test and merge it later today.

Other than VRAM usage, did you see a noticeable improvement in the text generation speed?

Feb 01 '23 13:02 oobabooga

Other than VRAM usage, did you see a noticeable improvement in the text generation speed?

I've seen the opposite, probably due to the partitioning that happens under ZeRO-3. It seems like using this would only make sense if you have large models to load, or if you want to make use of multiple GPUs. The Hugging Face docs do warn about performance and give a few more tuning tips here:

It’s important to understand that ZeRO-3 enables a much higher scalability capacity at a cost of speed. https://huggingface.co/docs/transformers/main_classes/deepspeed#zero2-vs-zero3-performance

Another thing which is a bit confusing is that the ZeRO-Inference that's integrated into transformers (and tried above in that commit) is a different technology than the new DeepSpeed-Inference.

Here is a good rundown:

ZeRO-Inference is designed for inference applications that require GPU acceleration but lack sufficient GPU memory to host the model. Also, ZeRO-Inference is optimized for inference applications that are throughput-oriented and allow large batch sizes. Alternative techniques, such as Accelerate, DeepSpeed-Inference, and DeepSpeed-MII that fit the entire model into GPU memory, possibly using multiple GPUs, are more suitable for inference applications that are latency sensitive or have small batch sizes. https://www.deepspeed.ai/2022/09/09/zero-inference.html#when-to-use-zero-inference

And an interesting article: https://towardsdatascience.com/deepspeed-deep-dive-model-implementations-for-inference-mii-b02aa5d5e7f7

Now, I've also tried DeepSpeed-Inference briefly but they have a number of bugs that are being worked on with regards to using split Hugging Face checkpoints (like Pygmalion 6B), bad output and model incompatibility. Worth keeping an eye on, however.

Feb 01 '23 15:02 ghost

I have accepted the PR and have some observations:

With Pygmalion-6b, the VRAM usage is decreased from 12GB to something like 5.5GB, and the performance seems to be better than what would be obtained with --auto-devices --gpu-memory 6.
While trying to load a very large model without --nvme-offload-dir, my system freezed due to lack of RAM.
I couldn't use --nvme-offload-dir because it threw an error about me not having GLIBCXX_3.4.30 installed.

Feb 02 '23 13:02 oobabooga

I couldn't use --nvme-offload-dir because it threw an error about me not having GLIBCXX_3.4.30 installed.

Could be something to do with Conda, try:

$ conda install -c conda-forge gcc

While trying to load a very large model without --nvme-offload-dir, my system freezed due to lack of RAM

I've seen that. Limiting the memory with cgroups can help:

$ systemd-run --user --scope -p MemoryHigh=15G -p MemoryMax=16G bash
$ conda activate textgen
$ deepspeed --num_gpus=1 server.py --model pygmalion-6b --cai-chat --deepspeed
(.....)
DeepSpeed ZeRO-3 is enabled: True
Loaded the model in 98.31 seconds.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

Dialogue tokenized to:


|  So how did you get into computer engineering?

Feb 02 '23 19:02 ghost

Installing the latest gcc version with conda worked, but then --nvme-offload-dir threw a generic error about jit software/hardware incompatibility when I tried to load a model.

As for limiting the maximum RAM with systemd-run, that caused deepspeed to become unresponsive and never load the model, even after several minutes.

Feb 02 '23 20:02 oobabooga

--nvme-offload-dir threw a generic error about jit software/hardware incompatibility

Strange, could you check what ds_report says?

As for limiting the maximum RAM with systemd-run, that caused deepspeed to become unresponsive and never load the model, even after several minutes.

Admittedly the test above was with having 8GB of swap available and loading was much, much slower (nearly 2 minutes).

I want to verify if using models that are sharded into smaller chunks really makes a difference for the initial RAM requirement.

Feb 02 '23 21:02 ghost

Is there any actual benefit in using bfloat16 if the card supports it (Ampere & Lovelace) ? Better output? Better speed?

Feb 03 '23 01:02 Manimap

@Manimap, the docs claim it's faster. There's also a caveat for fp16:

enable bf16 if you own an Ampere or a newer GPU to make things faster. If you don’t have that hardware you may enable fp16 as long as you don’t use any model that was pre-trained in bf16 mixed precision (such as most t5 models). These usually overflow in fp16 and you will see garbage as output.

Feb 03 '23 07:02 ghost

Sharding appears to help.

For instance, trying to load the unsharded OPT-13B-Erebus model with 30GB of CPU RAM, 8GB of swap and NVME offloading led to OOM.

$ ls models/OPT-13B-Erebus
config.json  LICENSE.md  merges.txt  pytorch_model.bin  README.md  special_tokens_map.json  tokenizer_config.json  vocab.json

$ systemd-run --user --scope -p MemoryHigh=28G -p MemoryMax=30G bash

$ /usr/bin/time -f %M deepspeed --num_gpus=1 server.py --model OPT-13B-Erebus --notebook --deepspeed --nvme-offload-dir /mnt/offload/

[] [INFO] [launch.py:162:main] dist_world_size=1
[] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Loading OPT-13B-Erebus...
[] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 21388
[] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3', '-u', 'server.py', '--local_rank=0', '--model', 'OPT-13B-Erebus', '--notebook', '--deepspeed', '--nvme-offload-dir', '/mnt/offload/'] exits with return code = -9
Command exited with non-zero status 247
30723720

OPT-13B-Erebus sharded into 1GB chunks on the other hand could be loaded and the peak RAM usage looked better.

$ ls models/OPT-13B-Erebus-sharded
config.json                       pytorch_model-00005-of-00028.bin  pytorch_model-00011-of-00028.bin  pytorch_model-00017-of-00028.bin  pytorch_model-00023-of-00028.bin  pytorch_model.bin.index.json
merges.txt                        pytorch_model-00006-of-00028.bin  pytorch_model-00012-of-00028.bin  pytorch_model-00018-of-00028.bin  pytorch_model-00024-of-00028.bin  special_tokens_map.json
pytorch_model-00001-of-00028.bin  pytorch_model-00007-of-00028.bin  pytorch_model-00013-of-00028.bin  pytorch_model-00019-of-00028.bin  pytorch_model-00025-of-00028.bin  tokenizer_config.json
pytorch_model-00002-of-00028.bin  pytorch_model-00008-of-00028.bin  pytorch_model-00014-of-00028.bin  pytorch_model-00020-of-00028.bin  pytorch_model-00026-of-00028.bin  tokenizer.json
pytorch_model-00003-of-00028.bin  pytorch_model-00009-of-00028.bin  pytorch_model-00015-of-00028.bin  pytorch_model-00021-of-00028.bin  pytorch_model-00027-of-00028.bin  vocab.json
pytorch_model-00004-of-00028.bin  pytorch_model-00010-of-00028.bin  pytorch_model-00016-of-00028.bin  pytorch_model-00022-of-00028.bin  pytorch_model-00028-of-00028.bin

$ systemd-run --user --scope -p MemoryHigh=28G -p MemoryMax=30G bash

$ deepspeed --num_gpus=1 server.py --model OPT-13B-Erebus-sharded --notebook --deepspeed --nvme-offload-dir /mnt/offload/

[] [INFO] [launch.py:162:main] dist_world_size=1
[] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Loading OPT-13B-Erebus-sharded...

[] [INFO] [partition_parameters.py:413:__exit__] finished initializing model with 13.11B parameters

DeepSpeed ZeRO-3 is enabled: True
Loaded the model in 86.68 seconds.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

Feb 03 '23 12:02 ghost

This is useful to know @81300

Could this be used to instantiate models on Colab without a huge RAM usage? If so, it could be possible to initialize a new notebook by only installing the requirements.txt with pip, without the need to install conda and pytorch from scratch.

Feb 03 '23 12:02 oobabooga

@Manimap, the docs claim it's faster. There's also a caveat for fp16:

enable bf16 if you own an Ampere or a newer GPU to make things faster. If you don’t have that hardware you may enable fp16 as long as you don’t use any model that was pre-trained in bf16 mixed precision (such as most t5 models). These usually overflow in fp16 and you will see garbage as output.

Alright thanks, so faster for those who can run it, and maybe some pros of training models in this "mixed precision" in particular.

Feb 03 '23 17:02 Manimap

@oobabooga,

I tested Colab today. For larger models the ZeRO-3 CPU/NVME offloading makes heavy use of CPU RAM anyway. Google's safety mechanisms seem very sensitive, they will kill your process even if DeepSpeed would not actually run out of memory.

You can't use cgroups to throttle properly because the Colab runtime is within an unprivileged container. For that same reason you cannot create swap. The DeepSpeed config doesn't provide any knobs for max RAM to offload with (they have an open issue).

That said, you can of course disable offloading entirely and successfully instantiate a sharded Pygmalion 6B model onto the GPU with ZeRO-3. This requires very little CPU RAM - just the size of the biggest shard. But in this scenario the Nvidia T4 will run out of VRAM once you begin generating text. Maybe tuning allgather_bucket_size and reduce_bucket_size could reduce the VRAM footprint.

By the way, I discovered that a presharded Pygmalion 6B consisting of 2GB chunks instantiates just fine on the free Colab w/o DeepSpeed, 8-bit mode (https://github.com/oobabooga/text-generation-webui/issues/14#issuecomment-1402981335), auto-devices or Conda. Inference works. However the sharding must be done on a system with sufficient memory so I had to rehost the model (not ideal).

colab

Can play with a test notebook here. As you suggested, it doesn't install Conda and therefore loads up much quicker!

Feb 03 '23 20:02 ghost

@81300 with your resharded+safetensors rehost, the Colab loading times for pygmalion-6b have been reduced from 12 minutes to 5 minutes. Amazing! Thank you so much for this.

Indeed, using a rehost is not as pretty as lazy loading the model from disk the way the Kobold client does, but at the same time this allowed by the creativeml-openrail-m license of pygmalion.

Google's safety mechanisms seem very sensitive, they will kill your process even if DeepSpeed would not actually run out of memory.

Yes, I have also noticed that. It's very annoying.

ZeRO-3 was not necessary for Colab for now, but maybe it will be later. In your computer, are you using it as your default way of offloading layers (instead of --auto-devices)?

Feb 03 '23 22:02 oobabooga

Just saying, but I made a pytorch bin file to safetensor converter that runs locally based on this if anyone is interested: pytorch-to-safetensor-converter

Feb 04 '23 03:02 Silver267

@Silver267 I am interested, thank you for making this.

Feb 04 '23 03:02 oobabooga

In your computer, are you using it as your default way of offloading layers (instead of --auto-devices)?

Yes, I've been using it for CPU offloading mostly. In --cai-chat mode it made long character contexts more manageable w/o running out of VRAM.

@Silver267 - nice. In case it's useful for your project, I resharded Pygmalion using this.

Feb 04 '23 12:02 ghost

@81300 Thanks for the information! Though the code doesn't seem to support ram offload (my vram is 8gb), it would still be a useful reference.

Feb 04 '23 21:02 Silver267

I have accepted the PR and have some observations:

* With Pygmalion-6b, the VRAM usage is decreased from 12GB to something like 5.5GB, and the performance seems to be better than what would be obtained with `--auto-devices --gpu-memory 6`.

* While trying to load a very large model without `--nvme-offload-dir`, my system freezed due to lack of RAM.

* I couldn't use `--nvme-offload-dir` because it threw an error about me not having `GLIBCXX_3.4.30` installed.

I am also getting the same GLIBCXX_3.4.30 error. Then I got the same generic error about jit software/hardware incompatibility too. Are you on Arch btw?

I found that libaio has issues with DeepSpeed on archlinux

Feb 07 '23 11:02 lolxdmainkaisemaanlu

for some reason when I run with deepspeed I get

[2023-02-21 02:51:30,191] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
 Warning: chat mode currently becomes somewhat slower with text streaming on.
 Consider starting the web UI with the --no-stream option.

 [2023-02-21 02:51:33,044] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
 Loading pygmalion-6b...
 [2023-02-21 02:51:35,197] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 60
 [2023-02-21 02:51:35,197] [ERROR] [launch.py:324:sigkill_handler] ['/usr/local/bin/python', '-u', 'server.py', '--local_rank=0', '--deepspeed', '--gpu-memory', '10', '--cai-chat', '--model=pygmalion-6b'] exits with return code = -11

I'm running inside docker on wsl2 This happens on the line:

model = AutoModelForCausalLM.from_pretrained(Path(f"models/{model_name}"), torch_dtype=torch.bfloat16 if args.bf16 else torch.float16)

Feb 21 '23 03:02 ye7iaserag

Since ZeRO inference is implemented and seems to be working, closing this issue. Please open another issue if there are other problems.

Mar 31 '23 20:03 Silver267

This doesn't work in a multi-gpu setup because all the multiple MPI instances of server.py try to bind to the web port and will fail.

srun --nodes=1 --cpus-per-task 16  --gres=gpu:4 --pty ./run.sh 
The activation script must be sourced, otherwise the virtual environment will not work.
Setting vars
[2023-06-01 20:26:11,291] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1,2,3 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2023-06-01 20:26:11,409] [INFO] [runner.py:541:main] cmd = /p/haicluster/llama/text-generation-webui/sc_venv_template/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None ./server.py --deepspeed --chat --threads 24 --listen-host 0.0.0.0 --listen-port 5000 --listen --xformers --sdp-attention --trust-remote-code
[2023-06-01 20:26:18,273] [INFO] [launch.py:222:main] 0 EBVERSIONNCCL=2.12.12
[2023-06-01 20:26:18,273] [INFO] [launch.py:222:main] 0 EBROOTNCCL=/easybuild/2020/software/NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0
[2023-06-01 20:26:18,273] [INFO] [launch.py:222:main] 0 EBDEVELNCCL=/easybuild/2020/software/NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0/easybuild/NCCL-2.12.12-GCCcore-11.3.0-CUDA-11.7.0-easybuild-devel
[2023-06-01 20:26:18,273] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-06-01 20:26:18,273] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-06-01 20:26:18,273] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-06-01 20:26:18,273] [INFO] [launch.py:247:main] dist_world_size=4
[2023-06-01 20:26:18,273] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
WARNING:trust_remote_code is enabled. This is dangerous.
WARNING:trust_remote_code is enabled. This is dangerous.
WARNING:trust_remote_code is enabled. This is dangerous.
WARNING:trust_remote_code is enabled. This is dangerous.
[2023-06-01 20:26:31,737] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
bin /p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
bin /p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
bin /p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
bin /p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
INFO:Loading settings from settings.json...
INFO:Loading settings from settings.json...
INFO:Loading settings from settings.json...
INFO:Loading settings from settings.json...
INFO:Loading the extension "gallery"...
INFO:Loading the extension "gallery"...
INFO:Loading the extension "gallery"...
INFO:Loading the extension "gallery"...
Running on local URL:  http://0.0.0.0:5000

To create a public link, set `share=True` in `launch()`.
ERROR:    Traceback (most recent call last):
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
    server = await loop.create_server(
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
    raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
    await self.startup(sockets=sockets)
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
    logger.error(exc)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
    self._log(ERROR, msg, args, **kwargs)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
  File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
    args[1].msg = color + args[1].msg + '\x1b[0m'  # normal
TypeError: can only concatenate str (not "OSError") to str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/starlette/routing.py", line 686, in lifespan
    await receive()
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 139, in receive
    return await self.receive_queue.get()
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/queues.py", line 159, in get
    await getter
asyncio.exceptions.CancelledError

Exception in thread Thread-1 (run):
Traceback (most recent call last):
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
ERROR:    Traceback (most recent call last):
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
    server = await loop.create_server(
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
    raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
    await self.startup(sockets=sockets)
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
    logger.error(exc)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
    self._log(ERROR, msg, args, **kwargs)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
  File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
    args[1].msg = color + args[1].msg + '\x1b[0m'  # normal
TypeError: can only concatenate str (not "OSError") to str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/starlette/routing.py", line 686, in lifespan
    await receive()
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 139, in receive
    return await self.receive_queue.get()
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/queues.py", line 159, in get
    await getter
asyncio.exceptions.CancelledError

Exception in thread Thread-1 (run):
Traceback (most recent call last):
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
ERROR:    Traceback (most recent call last):
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
    server = await loop.create_server(
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
    raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
    await self.startup(sockets=sockets)
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
    logger.error(exc)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
    self._log(ERROR, msg, args, **kwargs)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
  File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
    args[1].msg = color + args[1].msg + '\x1b[0m'  # normal
TypeError: can only concatenate str (not "OSError") to str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/starlette/routing.py", line 686, in lifespan
    await receive()
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 139, in receive
    return await self.receive_queue.get()
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/queues.py", line 159, in get
    await getter
asyncio.exceptions.CancelledError

Exception in thread Thread-1 (run):
Traceback (most recent call last):
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
    server = await loop.create_server(
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
    raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 946, in run
    server = await loop.create_server(
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
    self._target(*self._args, **self._kwargs)
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run
    server = await loop.create_server(
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
    raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
    raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
      File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
self.run()
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 946, in run
        self.run()self._target(*self._args, **self._kwargs)

  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 946, in run
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run
    self._target(*self._args, **self._kwargs)
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
    return asyncio.run(self.serve(sockets=sockets))
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
    return asyncio.run(self.serve(sockets=sockets))
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
    await self.startup(sockets=sockets)
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
    await self.startup(sockets=sockets)
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
    await self.startup(sockets=sockets)
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
    logger.error(exc)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
    self._log(ERROR, msg, args, **kwargs)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
  File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
    logger.error(exc)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
    self._log(ERROR, msg, args, **kwargs)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
    logger.error(exc)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
    args[1].msg = color + args[1].msg + '\x1b[0m'  # normal
TypeError: can only concatenate str (not "OSError") to str
    self.emit(record)
  File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
    self._log(ERROR, msg, args, **kwargs)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
  File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
    args[1].msg = color + args[1].msg + '\x1b[0m'  # normal
TypeError: can only concatenate str (not "OSError") to str
    args[1].msg = color + args[1].msg + '\x1b[0m'  # normal
TypeError: can only concatenate str (not "OSError") to str
^C[2023-06-01 20:27:02,022] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350800
Traceback (most recent call last):
Traceback (most recent call last):
  File "/p/haicluster/llama/text-generation-webui/./server.py", line 1109, in <module>
  File "/p/haicluster/llama/text-generation-webui/./server.py", line 1109, in <module>
Traceback (most recent call last):
  File "/p/haicluster/llama/text-generation-webui/./server.py", line 1109, in <module>
Traceback (most recent call last):
  File "/p/haicluster/llama/text-generation-webui/./server.py", line 1111, in <module>
[2023-06-01 20:27:02,024] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350800
    create_interface()
  File "/p/haicluster/llama/text-generation-webui/./server.py", line 1014, in create_interface
    create_interface()
  File "/p/haicluster/llama/text-generation-webui/./server.py", line 1014, in create_interface
    create_interface()
  File "/p/haicluster/llama/text-generation-webui/./server.py", line 1014, in create_interface
    time.sleep(0.5)
KeyboardInterrupt
    shared.gradio['interface'].launch(prevent_thread_lock=True, share=shared.args.share, server_name=shared.args.listen_host or '0.0.0.0', server_port=shared.args.listen_port, inbrowser=shared.args.auto_launch, auth=auth)
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1743, in launch
    shared.gradio['interface'].launch(prevent_thread_lock=True, share=shared.args.share, server_name=shared.args.listen_host or '0.0.0.0', server_port=shared.args.listen_port, inbrowser=shared.args.auto_launch, auth=auth)
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1743, in launch
    shared.gradio['interface'].launch(prevent_thread_lock=True, share=shared.args.share, server_name=shared.args.listen_host or '0.0.0.0', server_port=shared.args.listen_port, inbrowser=shared.args.auto_launch, auth=auth)
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1743, in launch
    server_name, server_port, local_url, app, server = networking.start_server(
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/networking.py", line 161, in start_server
    server_name, server_port, local_url, app, server = networking.start_server(
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/networking.py", line 161, in start_server
Traceback (most recent call last):
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/bin/deepspeed", line 6, in <module>
[2023-06-01 20:27:02,123] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350800
    main()
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 556, in main
    result.wait()
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/subprocess.py", line 1204, in wait
    return self._wait(timeout=timeout)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/subprocess.py", line 1938, in _wait
[2023-06-01 20:27:02,133] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350801
    (pid, sts) = self._try_wait(0)
  File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/subprocess.py", line 1896, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 548, in sigkill_handler
    result_kill = subprocess.Popen(kill_cmd, env=env)
NameError: free variable 'kill_cmd' referenced before assignment in enclosing scope
    server.run_in_thread()
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/networking.py", line 39, in run_in_thread
    server.run_in_thread()
  File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/networking.py", line 39, in run_in_thread
    time.sleep(1e-3)
KeyboardInterrupt
    time.sleep(1e-3)
KeyboardInterrupt
[2023-06-01 20:27:02,195] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350802
[2023-06-01 20:27:02,256] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350803
[2023-06-01 20:27:02,315] [INFO] [launch.py:437:sigkill_handler] Main process received SIGTERM, exiting
srun: error: haicluster3: task 0: Exited with exit code 1

Jun 01 '23 20:06 surak

Did someone run tests of DeepSpeed with Intel AMX capable CPU (Xeon 4th gen, Sapphire Rapids)?

Sep 07 '23 14:09 levicki

text-generation-webui text-generation-webui copied to clipboard

Implement ZeRO inference

text-generation-webui
text-generation-webui copied to clipboard