text-generation-webui
text-generation-webui copied to clipboard
Implement ZeRO inference
Some information Seems like ZeRO inference could improve the performance of offloading to RAM/NVMe. I don't know if huggingface's accelerate is already using it, but if not, it would be a great feature to add.
This is definitely worth looking into. There is also ONNX, which has a similar promise of improving inference speeds.
I have never really understood how to use those engines and how production-ready they are compared to transformers.
Here's a draft with some ideas: #43
There would be 3 new arguments for server.py
:
-
--deepspeed
enables DeepSpeed ZeRO-3 inference with CPU offloading -
--nvme-offload-dir
optionally points to an offload directory that should be on a NVME drive -
--bf16
for bfloat16 precision if you have an Ampere GPU, otherwise omit it and float16 is used
DeepSpeed must be installed: pip install deepspeed==0.8.0
Using --nvme-offload-dir
requires installing libaio-dev
. For example: apt install libaio-dev
Running should be done with the DeepSpeed launcher:
deepspeed --num_gpus=1 server.py --cai-chat --deepspeed
For NVME offloading:
deepspeed --num_gpus=1 server.py --cai-chat --deepspeed --nvme-offload-dir /mnt/offload
YMMV. While I haven't tested multi-GPU setups yet, in my tests the VRAM usage on a single card was greatly optimized.
That's very exciting. I will test and merge it later today.
Other than VRAM usage, did you see a noticeable improvement in the text generation speed?
Other than VRAM usage, did you see a noticeable improvement in the text generation speed?
I've seen the opposite, probably due to the partitioning that happens under ZeRO-3. It seems like using this would only make sense if you have large models to load, or if you want to make use of multiple GPUs. The Hugging Face docs do warn about performance and give a few more tuning tips here:
It’s important to understand that ZeRO-3 enables a much higher scalability capacity at a cost of speed. https://huggingface.co/docs/transformers/main_classes/deepspeed#zero2-vs-zero3-performance
Another thing which is a bit confusing is that the ZeRO-Inference that's integrated into transformers
(and tried above in that commit) is a different technology than the new DeepSpeed-Inference.
Here is a good rundown:
ZeRO-Inference is designed for inference applications that require GPU acceleration but lack sufficient GPU memory to host the model. Also, ZeRO-Inference is optimized for inference applications that are throughput-oriented and allow large batch sizes. Alternative techniques, such as Accelerate, DeepSpeed-Inference, and DeepSpeed-MII that fit the entire model into GPU memory, possibly using multiple GPUs, are more suitable for inference applications that are latency sensitive or have small batch sizes. https://www.deepspeed.ai/2022/09/09/zero-inference.html#when-to-use-zero-inference
And an interesting article: https://towardsdatascience.com/deepspeed-deep-dive-model-implementations-for-inference-mii-b02aa5d5e7f7
Now, I've also tried DeepSpeed-Inference briefly but they have a number of bugs that are being worked on with regards to using split Hugging Face checkpoints (like Pygmalion 6B), bad output and model incompatibility. Worth keeping an eye on, however.
I have accepted the PR and have some observations:
- With Pygmalion-6b, the VRAM usage is decreased from 12GB to something like 5.5GB, and the performance seems to be better than what would be obtained with
--auto-devices --gpu-memory 6
. - While trying to load a very large model without
--nvme-offload-dir
, my system freezed due to lack of RAM. - I couldn't use
--nvme-offload-dir
because it threw an error about me not havingGLIBCXX_3.4.30
installed.
I couldn't use --nvme-offload-dir because it threw an error about me not having GLIBCXX_3.4.30 installed.
Could be something to do with Conda, try:
$ conda install -c conda-forge gcc
While trying to load a very large model without --nvme-offload-dir, my system freezed due to lack of RAM
I've seen that. Limiting the memory with cgroups can help:
$ systemd-run --user --scope -p MemoryHigh=15G -p MemoryMax=16G bash
$ conda activate textgen
$ deepspeed --num_gpus=1 server.py --model pygmalion-6b --cai-chat --deepspeed
(.....)
DeepSpeed ZeRO-3 is enabled: True
Loaded the model in 98.31 seconds.
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Dialogue tokenized to:
| So how did you get into computer engineering?
Installing the latest gcc version with conda worked, but then --nvme-offload-dir
threw a generic error about jit software/hardware incompatibility when I tried to load a model.
As for limiting the maximum RAM with systemd-run, that caused deepspeed to become unresponsive and never load the model, even after several minutes.
--nvme-offload-dir
threw a generic error about jit software/hardware incompatibility
Strange, could you check what ds_report
says?
As for limiting the maximum RAM with systemd-run, that caused deepspeed to become unresponsive and never load the model, even after several minutes.
Admittedly the test above was with having 8GB of swap available and loading was much, much slower (nearly 2 minutes).
I want to verify if using models that are sharded into smaller chunks really makes a difference for the initial RAM requirement.
Is there any actual benefit in using bfloat16 if the card supports it (Ampere & Lovelace) ? Better output? Better speed?
@Manimap, the docs claim it's faster. There's also a caveat for fp16:
enable bf16 if you own an Ampere or a newer GPU to make things faster. If you don’t have that hardware you may enable fp16 as long as you don’t use any model that was pre-trained in bf16 mixed precision (such as most t5 models). These usually overflow in fp16 and you will see garbage as output.
Sharding appears to help.
For instance, trying to load the unsharded OPT-13B-Erebus model with 30GB of CPU RAM, 8GB of swap and NVME offloading led to OOM.
$ ls models/OPT-13B-Erebus
config.json LICENSE.md merges.txt pytorch_model.bin README.md special_tokens_map.json tokenizer_config.json vocab.json
$ systemd-run --user --scope -p MemoryHigh=28G -p MemoryMax=30G bash
$ /usr/bin/time -f %M deepspeed --num_gpus=1 server.py --model OPT-13B-Erebus --notebook --deepspeed --nvme-offload-dir /mnt/offload/
[] [INFO] [launch.py:162:main] dist_world_size=1
[] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Loading OPT-13B-Erebus...
[] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 21388
[] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3', '-u', 'server.py', '--local_rank=0', '--model', 'OPT-13B-Erebus', '--notebook', '--deepspeed', '--nvme-offload-dir', '/mnt/offload/'] exits with return code = -9
Command exited with non-zero status 247
30723720
OPT-13B-Erebus sharded into 1GB chunks on the other hand could be loaded and the peak RAM usage looked better.
$ ls models/OPT-13B-Erebus-sharded
config.json pytorch_model-00005-of-00028.bin pytorch_model-00011-of-00028.bin pytorch_model-00017-of-00028.bin pytorch_model-00023-of-00028.bin pytorch_model.bin.index.json
merges.txt pytorch_model-00006-of-00028.bin pytorch_model-00012-of-00028.bin pytorch_model-00018-of-00028.bin pytorch_model-00024-of-00028.bin special_tokens_map.json
pytorch_model-00001-of-00028.bin pytorch_model-00007-of-00028.bin pytorch_model-00013-of-00028.bin pytorch_model-00019-of-00028.bin pytorch_model-00025-of-00028.bin tokenizer_config.json
pytorch_model-00002-of-00028.bin pytorch_model-00008-of-00028.bin pytorch_model-00014-of-00028.bin pytorch_model-00020-of-00028.bin pytorch_model-00026-of-00028.bin tokenizer.json
pytorch_model-00003-of-00028.bin pytorch_model-00009-of-00028.bin pytorch_model-00015-of-00028.bin pytorch_model-00021-of-00028.bin pytorch_model-00027-of-00028.bin vocab.json
pytorch_model-00004-of-00028.bin pytorch_model-00010-of-00028.bin pytorch_model-00016-of-00028.bin pytorch_model-00022-of-00028.bin pytorch_model-00028-of-00028.bin
$ systemd-run --user --scope -p MemoryHigh=28G -p MemoryMax=30G bash
$ deepspeed --num_gpus=1 server.py --model OPT-13B-Erebus-sharded --notebook --deepspeed --nvme-offload-dir /mnt/offload/
[] [INFO] [launch.py:162:main] dist_world_size=1
[] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Loading OPT-13B-Erebus-sharded...
[] [INFO] [partition_parameters.py:413:__exit__] finished initializing model with 13.11B parameters
DeepSpeed ZeRO-3 is enabled: True
Loaded the model in 86.68 seconds.
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
This is useful to know @81300
Could this be used to instantiate models on Colab without a huge RAM usage? If so, it could be possible to initialize a new notebook by only installing the requirements.txt
with pip, without the need to install conda and pytorch from scratch.
@Manimap, the docs claim it's faster. There's also a caveat for fp16:
enable bf16 if you own an Ampere or a newer GPU to make things faster. If you don’t have that hardware you may enable fp16 as long as you don’t use any model that was pre-trained in bf16 mixed precision (such as most t5 models). These usually overflow in fp16 and you will see garbage as output.
Alright thanks, so faster for those who can run it, and maybe some pros of training models in this "mixed precision" in particular.
@oobabooga,
I tested Colab today. For larger models the ZeRO-3 CPU/NVME offloading makes heavy use of CPU RAM anyway. Google's safety mechanisms seem very sensitive, they will kill your process even if DeepSpeed would not actually run out of memory.
You can't use cgroups to throttle properly because the Colab runtime is within an unprivileged container. For that same reason you cannot create swap. The DeepSpeed config doesn't provide any knobs for max RAM to offload with (they have an open issue).
That said, you can of course disable offloading entirely and successfully instantiate a sharded Pygmalion 6B model onto the GPU with ZeRO-3. This requires very little CPU RAM - just the size of the biggest shard. But in this scenario the Nvidia T4 will run out of VRAM once you begin generating text. Maybe tuning allgather_bucket_size
and reduce_bucket_size
could reduce the VRAM footprint.
By the way, I discovered that a presharded Pygmalion 6B consisting of 2GB chunks instantiates just fine on the free Colab w/o DeepSpeed, 8-bit mode (https://github.com/oobabooga/text-generation-webui/issues/14#issuecomment-1402981335), auto-devices or Conda. Inference works. However the sharding must be done on a system with sufficient memory so I had to rehost the model (not ideal).
Can play with a test notebook here. As you suggested, it doesn't install Conda and therefore loads up much quicker!
@81300 with your resharded+safetensors rehost, the Colab loading times for pygmalion-6b
have been reduced from 12 minutes to 5 minutes. Amazing! Thank you so much for this.
Indeed, using a rehost is not as pretty as lazy loading the model from disk the way the Kobold client does, but at the same time this allowed by the creativeml-openrail-m
license of pygmalion.
Google's safety mechanisms seem very sensitive, they will kill your process even if DeepSpeed would not actually run out of memory.
Yes, I have also noticed that. It's very annoying.
ZeRO-3 was not necessary for Colab for now, but maybe it will be later. In your computer, are you using it as your default way of offloading layers (instead of --auto-devices)?
Just saying, but I made a pytorch bin file to safetensor converter that runs locally based on this if anyone is interested: pytorch-to-safetensor-converter
@Silver267 I am interested, thank you for making this.
In your computer, are you using it as your default way of offloading layers (instead of --auto-devices)?
Yes, I've been using it for CPU offloading mostly. In --cai-chat
mode it made long character contexts more manageable w/o running out of VRAM.
@Silver267 - nice. In case it's useful for your project, I resharded Pygmalion using this.
@81300 Thanks for the information! Though the code doesn't seem to support ram offload (my vram is 8gb), it would still be a useful reference.
I have accepted the PR and have some observations:
* With Pygmalion-6b, the VRAM usage is decreased from 12GB to something like 5.5GB, and the performance seems to be better than what would be obtained with `--auto-devices --gpu-memory 6`. * While trying to load a very large model without `--nvme-offload-dir`, my system freezed due to lack of RAM. * I couldn't use `--nvme-offload-dir` because it threw an error about me not having `GLIBCXX_3.4.30` installed.
I am also getting the same GLIBCXX_3.4.30
error. Then I got the same generic error about jit software/hardware incompatibility too. Are you on Arch btw?
I found that libaio has issues with DeepSpeed on archlinux
for some reason when I run with deepspeed I get
[2023-02-21 02:51:30,191] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
Warning: chat mode currently becomes somewhat slower with text streaming on.
Consider starting the web UI with the --no-stream option.
[2023-02-21 02:51:33,044] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Loading pygmalion-6b...
[2023-02-21 02:51:35,197] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 60
[2023-02-21 02:51:35,197] [ERROR] [launch.py:324:sigkill_handler] ['/usr/local/bin/python', '-u', 'server.py', '--local_rank=0', '--deepspeed', '--gpu-memory', '10', '--cai-chat', '--model=pygmalion-6b'] exits with return code = -11
I'm running inside docker on wsl2 This happens on the line:
model = AutoModelForCausalLM.from_pretrained(Path(f"models/{model_name}"), torch_dtype=torch.bfloat16 if args.bf16 else torch.float16)
Since ZeRO inference is implemented and seems to be working, closing this issue. Please open another issue if there are other problems.
This doesn't work in a multi-gpu setup because all the multiple MPI instances of server.py try to bind to the web port and will fail.
srun --nodes=1 --cpus-per-task 16 --gres=gpu:4 --pty ./run.sh
The activation script must be sourced, otherwise the virtual environment will not work.
Setting vars
[2023-06-01 20:26:11,291] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1,2,3 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2023-06-01 20:26:11,409] [INFO] [runner.py:541:main] cmd = /p/haicluster/llama/text-generation-webui/sc_venv_template/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None ./server.py --deepspeed --chat --threads 24 --listen-host 0.0.0.0 --listen-port 5000 --listen --xformers --sdp-attention --trust-remote-code
[2023-06-01 20:26:18,273] [INFO] [launch.py:222:main] 0 EBVERSIONNCCL=2.12.12
[2023-06-01 20:26:18,273] [INFO] [launch.py:222:main] 0 EBROOTNCCL=/easybuild/2020/software/NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0
[2023-06-01 20:26:18,273] [INFO] [launch.py:222:main] 0 EBDEVELNCCL=/easybuild/2020/software/NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0/easybuild/NCCL-2.12.12-GCCcore-11.3.0-CUDA-11.7.0-easybuild-devel
[2023-06-01 20:26:18,273] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-06-01 20:26:18,273] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-06-01 20:26:18,273] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-06-01 20:26:18,273] [INFO] [launch.py:247:main] dist_world_size=4
[2023-06-01 20:26:18,273] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
WARNING:trust_remote_code is enabled. This is dangerous.
WARNING:trust_remote_code is enabled. This is dangerous.
WARNING:trust_remote_code is enabled. This is dangerous.
WARNING:trust_remote_code is enabled. This is dangerous.
[2023-06-01 20:26:31,737] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
bin /p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
bin /p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
bin /p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
bin /p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
INFO:Loading settings from settings.json...
INFO:Loading settings from settings.json...
INFO:Loading settings from settings.json...
INFO:Loading settings from settings.json...
INFO:Loading the extension "gallery"...
INFO:Loading the extension "gallery"...
INFO:Loading the extension "gallery"...
INFO:Loading the extension "gallery"...
Running on local URL: http://0.0.0.0:5000
To create a public link, set `share=True` in `launch()`.
ERROR: Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
server = await loop.create_server(
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
await self.startup(sockets=sockets)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
logger.error(exc)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
self._log(ERROR, msg, args, **kwargs)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
self.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
self.callHandlers(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
hdlr.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
self.emit(record)
File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
args[1].msg = color + args[1].msg + '\x1b[0m' # normal
TypeError: can only concatenate str (not "OSError") to str
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/starlette/routing.py", line 686, in lifespan
await receive()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 139, in receive
return await self.receive_queue.get()
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/queues.py", line 159, in get
await getter
asyncio.exceptions.CancelledError
Exception in thread Thread-1 (run):
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
ERROR: Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
server = await loop.create_server(
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
await self.startup(sockets=sockets)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
logger.error(exc)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
self._log(ERROR, msg, args, **kwargs)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
self.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
self.callHandlers(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
hdlr.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
self.emit(record)
File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
args[1].msg = color + args[1].msg + '\x1b[0m' # normal
TypeError: can only concatenate str (not "OSError") to str
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/starlette/routing.py", line 686, in lifespan
await receive()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 139, in receive
return await self.receive_queue.get()
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/queues.py", line 159, in get
await getter
asyncio.exceptions.CancelledError
Exception in thread Thread-1 (run):
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
ERROR: Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
server = await loop.create_server(
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
await self.startup(sockets=sockets)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
logger.error(exc)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
self._log(ERROR, msg, args, **kwargs)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
self.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
self.callHandlers(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
hdlr.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
self.emit(record)
File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
args[1].msg = color + args[1].msg + '\x1b[0m' # normal
TypeError: can only concatenate str (not "OSError") to str
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/starlette/routing.py", line 686, in lifespan
await receive()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 139, in receive
return await self.receive_queue.get()
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/queues.py", line 159, in get
await getter
asyncio.exceptions.CancelledError
Exception in thread Thread-1 (run):
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
server = await loop.create_server(
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
self.run()
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 946, in run
server = await loop.create_server(
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
self._target(*self._args, **self._kwargs)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run
server = await loop.create_server(
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
self.run()
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 946, in run
self.run()self._target(*self._args, **self._kwargs)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 946, in run
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run
self._target(*self._args, **self._kwargs)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run
return asyncio.run(self.serve(sockets=sockets))
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
return asyncio.run(self.serve(sockets=sockets))
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
return asyncio.run(self.serve(sockets=sockets))
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
await self.startup(sockets=sockets)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
await self.startup(sockets=sockets)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
await self.startup(sockets=sockets)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
logger.error(exc)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
self._log(ERROR, msg, args, **kwargs)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
self.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
self.callHandlers(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
hdlr.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
self.emit(record)
File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
logger.error(exc)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
self._log(ERROR, msg, args, **kwargs)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
self.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
self.callHandlers(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
hdlr.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
logger.error(exc)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
args[1].msg = color + args[1].msg + '\x1b[0m' # normal
TypeError: can only concatenate str (not "OSError") to str
self.emit(record)
File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
self._log(ERROR, msg, args, **kwargs)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
self.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
self.callHandlers(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
hdlr.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
self.emit(record)
File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
args[1].msg = color + args[1].msg + '\x1b[0m' # normal
TypeError: can only concatenate str (not "OSError") to str
args[1].msg = color + args[1].msg + '\x1b[0m' # normal
TypeError: can only concatenate str (not "OSError") to str
^C[2023-06-01 20:27:02,022] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350800
Traceback (most recent call last):
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/./server.py", line 1109, in <module>
File "/p/haicluster/llama/text-generation-webui/./server.py", line 1109, in <module>
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/./server.py", line 1109, in <module>
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/./server.py", line 1111, in <module>
[2023-06-01 20:27:02,024] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350800
create_interface()
File "/p/haicluster/llama/text-generation-webui/./server.py", line 1014, in create_interface
create_interface()
File "/p/haicluster/llama/text-generation-webui/./server.py", line 1014, in create_interface
create_interface()
File "/p/haicluster/llama/text-generation-webui/./server.py", line 1014, in create_interface
time.sleep(0.5)
KeyboardInterrupt
shared.gradio['interface'].launch(prevent_thread_lock=True, share=shared.args.share, server_name=shared.args.listen_host or '0.0.0.0', server_port=shared.args.listen_port, inbrowser=shared.args.auto_launch, auth=auth)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1743, in launch
shared.gradio['interface'].launch(prevent_thread_lock=True, share=shared.args.share, server_name=shared.args.listen_host or '0.0.0.0', server_port=shared.args.listen_port, inbrowser=shared.args.auto_launch, auth=auth)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1743, in launch
shared.gradio['interface'].launch(prevent_thread_lock=True, share=shared.args.share, server_name=shared.args.listen_host or '0.0.0.0', server_port=shared.args.listen_port, inbrowser=shared.args.auto_launch, auth=auth)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1743, in launch
server_name, server_port, local_url, app, server = networking.start_server(
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/networking.py", line 161, in start_server
server_name, server_port, local_url, app, server = networking.start_server(
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/networking.py", line 161, in start_server
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/bin/deepspeed", line 6, in <module>
[2023-06-01 20:27:02,123] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350800
main()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 556, in main
result.wait()
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/subprocess.py", line 1204, in wait
return self._wait(timeout=timeout)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/subprocess.py", line 1938, in _wait
[2023-06-01 20:27:02,133] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350801
(pid, sts) = self._try_wait(0)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/subprocess.py", line 1896, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 548, in sigkill_handler
result_kill = subprocess.Popen(kill_cmd, env=env)
NameError: free variable 'kill_cmd' referenced before assignment in enclosing scope
server.run_in_thread()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/networking.py", line 39, in run_in_thread
server.run_in_thread()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/networking.py", line 39, in run_in_thread
time.sleep(1e-3)
KeyboardInterrupt
time.sleep(1e-3)
KeyboardInterrupt
[2023-06-01 20:27:02,195] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350802
[2023-06-01 20:27:02,256] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350803
[2023-06-01 20:27:02,315] [INFO] [launch.py:437:sigkill_handler] Main process received SIGTERM, exiting
srun: error: haicluster3: task 0: Exited with exit code 1
Did someone run tests of DeepSpeed with Intel AMX capable CPU (Xeon 4th gen, Sapphire Rapids)?