it run
!git clone https://github.com/NetEase-FuXi/EETQ.git %cd EETQ/ !git submodule update --init --recursive
!wget https://github.com/NetEase-FuXi/EETQ/releases/download/v1.0.1/EETQ-1.0.1-cp310-cp310-linux_x86_64.whl !pip install /content/EETQ-1.0.1-cp310-cp310-linux_x86_64.whl
!huggingface-cli login --token xxxxxxxxxxxxx
from transformers import AutoModelForCausalLM, EetqConfig model_name = "meta-llama/Llama-2-7b-chat-hf" quantization_config = EetqConfig("int8") model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=quantization_config)
quant_path = "/content/a" model.save_pretrained(quant_path) model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
from transformers import AutoTokenizer, AutoModelForCausalLM, EetqConfig
Load the model and configuration from the quantized model path
quantization_config = EetqConfig("int8") model = AutoModelForCausalLM.from_pretrained("/content/a", config=quantization_config, torch_dtype=torch.float16)
Load the tokenizer using the original model name or path
model_name_or_path = "meta-llama/Llama-2-7b-chat-hf" # Replace with your original model name/path tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
Apply EET acceleration
from eetq.utils import eet_accelerator eet_accelerator(model, quantize=True, fused_attn=True, dev="cuda:0") model.to("cuda:0")
Prepare your input text
text = "Who is Napoleon Bonaparte?"
Tokenize the input text
input_ids = tokenizer(text, return_tensors="pt").input_ids.to("cuda:0")
Generate text
res = model.generate(input_ids, max_length=128)
Decode the generated output
print(tokenizer.decode(res[0], skip_special_tokens=True))
low_cpu_mem_usage was None, now default to True since model is quantized.
Loading checkpoint shards: 100%
2/2 [00:00<00:00, 2.49it/s]
[EET][INFO] attention fusion processiong...: 0it [00:00, ?it/s]
[EET][INFO] replace with eet weight quantize only linear...: 100%|██████████| 32/32 [00:00<00:00, 25176.84it/s]
Who is Napoleon Bonaparte?
Napoleon Bonaparte (1769-1821) was a French military and political leader who rose to prominence during the French Revolution and its associated wars. He was Emperor of the French from 1804 until 1815, when he was defeated in the Napoleonic Wars and exiled to the island of Saint Helena, where he died.
Napoleon was born on the island of Corsica and studied at the École Militaire in Paris. He quickly rose through the ranks of the French
text = "Write a Python code to print the letter X 100 times."
Tokenize the input text
input_ids = tokenizer(text, return_tensors="pt").input_ids.to("cuda:0")
Generate text
res = model.generate(input_ids, max_length=128)
Decode the generated output
print(tokenizer.decode(res[0], skip_special_tokens=True))
Write a Python code to print the letter X 100 times.
Here is the solution:
print("X")
for i in range(1, 101):
print("X")
Explanation:
The print() function is used to print the letter X. The range() function is used to create a sequence of numbers from 1 to 100. The for loop is used to iterate over the sequence and print the letter X for each number.
Alternatively, you can use a list comprehension to print
Does the library support 4bit?
Does the library support 4bit?
No, we do not support 4bit and 4bit will cause a big loss degradation.
https://github.com/werruww/EETQ-quantization/blob/main/suc_EETQ.ipynb