transformers-bloom-inference issues

Inference(chatbot) does not work as expected on 2 gpus with bigscience/bloom-7b1 model

2

I am trying to create a simple chatbot using bloom-7b1 model (may use bigger models later) based on bloom-ds-zero-inference.py. Here is my code: ```import json import os from pathlib import...

dantalyon

pip install command does not work as expected

2

In the readme file, the following command is provided to install required dependencies. `pip install flask flask_api gunicorn pydantic accelerate huggingface_hub>=0.9.0 deepspeed>=0.7.3 deepspeed-mii==0.0.2` But when executing this command in shell,...

Billijk

ValueError: Couldn't instantiate the backend tokenizer from one of:

(gh_transformers-bloom-inference) amd00@MZ32-00:~/llm_dev/transformers-bloom-inference$ python bloom-inference-scripts/bloom-accelerate-inference.py --name ~/hf_model/bloom --batch_size 1 --benchmark Using 0 gpus Loading model /home/amd00/hf_model/bloom Traceback (most recent call last): File "/home/amd00/llm_dev/transformers-bloom-inference/bloom-inference-scripts/bloom-accelerate-inference.py", line 49, in tokenizer = AutoTokenizer.from_pretrained(model_name) File "/home/amd00/anaconda3/envs/gh_transformers-bloom-inference/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py",...

SeekPoint

[Bug] Int8 quantize inference failed using bloom-inference-scripts/bloom-ds-inference.py with deepspeed==0.9.0 on multi-gpus

1

I am using multi-gpu to quantize the model and inference with deepspeed==0.9.0, but failed. Device: RTX-3090 x 8 Server Docker: [nvidia-pytorch-container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) which tag is `22.07-py3`. Then git clone this codebase...

hanrui1sensetime

It does not work with Falcon-40B correctly

When using Falcon-40B with 'bloom-accelerate-inference.py' I am getting first the error that "ValueError: The following model_kwargs are not used by the model: ['token_type_ids'] (note: typos in the generate arguments will...

AGrosserHH

How to understand this note: "note: Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput can be further divided by 8 or 16 ..."

1

Why is it said that only ds_zero is currently doing world_size streams on world_size gpus, while acclerate and ds inference should be doing the same as well since they also...

HuipengXu

When deploying the Bloom model, I noticed that the POST method is used for the generation task. Is it possible to modify it to perform question-answering instead?

After executing make gen-proto and make bloom-560m, I observed that the generated text is related to the input text. Is it possible to modify it to have a conversational style?...

dizhenx

The Makefile execution was successful, but there is no response when entering text.

When running the command "make bloom-560m" on CentOS, it executed successfully. However, when you entered text in the browser and clicked the submit button, it kept displaying the message "Processing"...

dizhenx

AttributeError: 'BloomForCausalLM' object has no attribute 'module'

I was test the deepspeed bloom inference with bloom-7b1 using 3 GPUs. This error didn't occur when I ran the accelerate inference. command line: ``` deepspeed --include localhost:2,5,6 --module inference_server.benchmark...

detectiveJoshua

Are there fine-tuning and inference scripts available for int4 quantization in bloom-7b? Is it possible to limit the GPU memory usage to within 10GB?

1

Where can I download bloom-7b? I noticed that int8 quantization is available, but is there an option for int4 quantization? What is the memory overhead for int4 and int8 when...

dizhenx

transformers-bloom-inference
transformers-bloom-inference copied to clipboard

Metadata

Inference(chatbot) does not work as expected on 2 gpus with bigscience/bloom-7b1 model

pip install command does not work as expected

ValueError: Couldn't instantiate the backend tokenizer from one of:

[Bug] Int8 quantize inference failed using bloom-inference-scripts/bloom-ds-inference.py with deepspeed==0.9.0 on multi-gpus

It does not work with Falcon-40B correctly

How to understand this note: "note: Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput can be further divided by 8 or 16 ..."

When deploying the Bloom model, I noticed that the POST method is used for the generation task. Is it possible to modify it to perform question-answering instead?

The Makefile execution was successful, but there is no response when entering text.

AttributeError: 'BloomForCausalLM' object has no attribute 'module'

Are there fine-tuning and inference scripts available for int4 quantization in bloom-7b? Is it possible to limit the GPU memory usage to within 10GB?

← Metadata

Owner

Metadata

transformers-bloom-inference transformers-bloom-inference copied to clipboard

Metadata

← Metadata

Owner

Metadata

transformers-bloom-inference
transformers-bloom-inference copied to clipboard