!git clone https://github.com/NetEase-FuXi/EETQ.git %cd EETQ/ !git submodule update --init --recursive

!wget https://github.com/NetEase-FuXi/EETQ/releases/download/v1.0.1/EETQ-1.0.1-cp310-cp310-linux_x86_64.whl !pip install /content/EETQ-1.0.1-cp310-cp310-linux_x86_64.whl

!huggingface-cli login --token xxxxxxxxxxxxx

from transformers import AutoModelForCausalLM, EetqConfig model_name = "meta-llama/Llama-2-7b-chat-hf" quantization_config = EetqConfig("int8") model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=quantization_config)

quant_path = "/content/a" model.save_pretrained(quant_path) model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")

from transformers import AutoTokenizer, AutoModelForCausalLM, EetqConfig

Load the model and configuration from the quantized model path

quantization_config = EetqConfig("int8") model = AutoModelForCausalLM.from_pretrained("/content/a", config=quantization_config, torch_dtype=torch.float16)

Load the tokenizer using the original model name or path

model_name_or_path = "meta-llama/Llama-2-7b-chat-hf" # Replace with your original model name/path tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

Apply EET acceleration

from eetq.utils import eet_accelerator eet_accelerator(model, quantize=True, fused_attn=True, dev="cuda:0") model.to("cuda:0")

Prepare your input text

text = "Who is Napoleon Bonaparte?"

Tokenize the input text

input_ids = tokenizer(text, return_tensors="pt").input_ids.to("cuda:0")

Generate text

res = model.generate(input_ids, max_length=128)

Decode the generated output

print(tokenizer.decode(res[0], skip_special_tokens=True))

low_cpu_mem_usage was None, now default to True since model is quantized. Loading checkpoint shards: 100% 2/2 [00:00<00:00, 2.49it/s] [EET][INFO] attention fusion processiong...: 0it [00:00, ?it/s] [EET][INFO] replace with eet weight quantize only linear...: 100%|██████████| 32/32 [00:00<00:00, 25176.84it/s] Who is Napoleon Bonaparte?

Napoleon Bonaparte (1769-1821) was a French military and political leader who rose to prominence during the French Revolution and its associated wars. He was Emperor of the French from 1804 until 1815, when he was defeated in the Napoleonic Wars and exiled to the island of Saint Helena, where he died.

Napoleon was born on the island of Corsica and studied at the École Militaire in Paris. He quickly rose through the ranks of the French

Dec 18 '24 00:12 werruww

text = "Write a Python code to print the letter X 100 times."

Tokenize the input text

input_ids = tokenizer(text, return_tensors="pt").input_ids.to("cuda:0")

Generate text

res = model.generate(input_ids, max_length=128)

Decode the generated output

print(tokenizer.decode(res[0], skip_special_tokens=True))

Write a Python code to print the letter X 100 times.

Here is the solution:

print("X")
for i in range(1, 101):
    print("X")

Explanation: The print() function is used to print the letter X. The range() function is used to create a sequence of numbers from 1 to 100. The for loop is used to iterate over the sequence and print the letter X for each number.

Alternatively, you can use a list comprehension to print

Dec 18 '24 01:12 werruww

Does the library support 4bit?

Dec 18 '24 01:12 werruww

Does the library support 4bit?

No, we do not support 4bit and 4bit will cause a big loss degradation.

Dec 18 '24 02:12 dtlzhuangz

https://github.com/werruww/EETQ-quantization/blob/main/suc_EETQ.ipynb

Feb 18 '25 02:02 werruww

it run

Load the model and configuration from the quantized model path

Load the tokenizer using the original model name or path

Apply EET acceleration

Prepare your input text

Tokenize the input text

Generate text

Decode the generated output

Tokenize the input text

Generate text

Decode the generated output