sparseml
sparseml copied to clipboard
Error converting mistral to onnx
Describe the bug Error converting mistral to onnx
Expected behavior
!pip install virtualenv
!virtualenv myenv
!source /content/myenv/bin/activate
!git clone https://github.com/neuralmagic/sparseml
#!pip install sparseml
!pip install -e "sparseml[transformers]"
#!pip uninstall transformers
#!pip install nm-transformers
!python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py OpenBuddy/openbuddy-mistral-7b-v13.1 open-platypus --recipe recipe.yaml --device cuda:0 --precision float16 --save True
Environment Include all relevant environment information:
- OS [e.g. Ubuntu 18.04]: 22
- Python version [e.g. 3.7]: 3.11
- SparseML version or commit hash [e.g. 0.1.0,
f7245c8]: - ML framework version(s) [e.g. torch 1.7.1]:
- Other Python package versions [e.g. SparseZoo, DeepSparse, numpy, ONNX]:
- Other relevant environment information [e.g. hardware, CUDA version]:
To Reproduce Exact steps to reproduce the behavior:
Errors !python sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment !cp deployment/model.onnx deployment/model-orig.onnx
Traceback (most recent call last):
File "/content/sparseml/src/sparseml/transformers/sparsification/obcq/export.py", line 542, in <module>
main()
File "/content/sparseml/src/sparseml/transformers/sparsification/obcq/export.py", line 529, in main
export(
File "/content/sparseml/src/sparseml/transformers/sparsification/obcq/export.py", line 507, in export
export_transformer_to_onnx(
File "/content/sparseml/src/sparseml/transformers/sparsification/obcq/export.py", line 345, in export_transformer_to_onnx
export_onnx(
File "/content/sparseml/src/sparseml/pytorch/utils/exporter.py", line 488, in export_onnx
out = tensors_module_forward(sample_batch, module, check_feat_lab_inp=False)
File "/content/sparseml/src/sparseml/pytorch/utils/helpers.py", line 414, in tensors_module_forward
return module(**tensors)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 1083, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 970, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 659, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 299, in forward
query_states = self.q_proj(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/ao/quantization/stubs.py", line 63, in forward
X = self.module(X)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/ao/nn/qat/modules/linear.py", line 41, in forward
return F.linear(input, self.weight_fake_quant(self.weight), self.bias)
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
cp: cannot stat 'deployment/model.onnx': No such file or directory
I also want to add that after quantization and optimization, the model remains the same size. Although the recipe says quantization 8
Hey @meomeomeome
Regarding your export issue, please use the following entrypoint for export;
sparseml.export --task text-generation --model_path obcq_deployment
Regarding the model size issue, could you paste here an artifact that illustrates the comparison? Perhaps some stdout from du -sh * or tree ?
sparseml.export --task text-generation --model_path obcq_deployment
In your instructions at https://github.com/neuralmagic/sparseml/tree/main/src/sparseml/transformers/sparsification/obcq
Model preparation is done with this command
python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py HuggingFaceH4/zephyr-7b-beta open_platypus --recipe recipe.yaml --precision float16 --save True
Those. we load the model in float16 format
Next is the conversion script python
sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment
Which is unable to perform half operations on the CPU
I studied the python files
sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment
and the file
src/sparseml/pytorch/utils/exporter.py from the library, there is an explicit loading of models into the CPU
Does your suggestion
sparseml.export --task text-generation --model_path obcq_deployment
solve the problem of exporting a model in onnx to float16? What library files are used for this and how is the problem of incompatibility of CPU operations with float16 solved?
Based on the size of the model, from my experience with Tiny LLama, I realized that the final reduction of the model occurs after complete conversion to onnx in the deployment folder
sparseml.export --task text-generation --model_path obcq_deployment
dont have options --model_path
also i can't try all for end procces convert to onnx because conversion kill process with 83 gb memory when bin files of model only 15 gb
Let me take a look, will come back to you shortly
Hey @meomeomeome
Short update from my side: I tried to recreate your problem locally.
- I generated your
obcq_deploymentdirectory - Exported the model using
sparseml.export obcq_deployment --trust_remote_code --sequence_length 64 --task text-generation. I confirm that the export takes a prohibitively large amount of CPU memory during the export. However, by specifying--sequence_lenght {int}argument, you can potentially reduce your peak memory consumption. Setting it to something smaller like 32 or 64 should work, but will naturally limit the capabilities of your model. This is a big issue and something that we are currently working on. - I was also able to reproduce the export error in
obcq/export.py(RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'). Not sure why you see this using the pathway. While we are looking into the issues, please note that this pathway will over time get deprecated in favor ofsparseml.export.
Hey @meomeomeome
Short update from my side: I tried to recreate your problem locally.
- I generated your
obcq_deploymentdirectory- Exported the model using
sparseml.export obcq_deployment --trust_remote_code --sequence_length 64 --task text-generation. I confirm that the export takes a prohibitively large amount of CPU memory during the export. However, by specifying--sequence_lenght {int}argument, you can potentially reduce your peak memory consumption. Setting it to something smaller like 32 or 64 should work, but will naturally limit the capabilities of your model. This is a big issue and something that we are currently working on.- I was also able to reproduce the export error in
obcq/export.py(RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'). Not sure why you see this using the pathway. While we are looking into the issues, please note that this pathway will over time get deprecated in favor ofsparseml.export.
Setting it to something smaller like 32 or 64 should work, but will naturally limit the capabilities of your model - What do you mean? Will this slow down the export process, or will the model lose quality after exporting to onnx? P.S. base model context window 4096
When running in our deepsparse pipeline you will not be able to generate more than e.g. 64 - num_tokens(prompt) tokens in the single inference. This will however reduce peak mem consumption as well as accelerate the export process
When running in our
deepsparsepipeline you will not be able to generate more than e.g.64 - num_tokens(prompt)tokens in the single inference. This will however reduce peak mem consumption as well as accelerate the export process
Does this apply only to the pipeline or any other output methods?
import psutil
import time
# Получение информации о памяти и CPU
memory_usage = psutil.virtual_memory()
cpu_frequency = psutil.cpu_freq()
print(f"Total Memory: {memory_usage.total / (1024 ** 3)} GB")
print(f"Memory Used: {memory_usage.used / (1024 ** 3)} GB")
# Получение информации о CPU
cpu_frequency = psutil.cpu_freq(percpu=True)
cpu_count = psutil.cpu_count(logical=False)
cpu_logical_count = psutil.cpu_count(logical=True)
cpu_model = None
with open("/proc/cpuinfo", "r") as f:
for line in f:
if "model name" in line:
cpu_model = line.strip().split(":")[1].strip()
break
# Вывод информации
print(f"CPU Model: {cpu_model}")
print(f"Physical Cores: {cpu_count}")
print(f"Logical Cores (including hyperthreading): {cpu_logical_count}")
for i, freq in enumerate(cpu_frequency):
print(f"Core {i}: {freq.current / 1000:.2f} GHz")
print(f"Total CPU Frequency: {psutil.cpu_freq().current / 1000:.2f} GHz")
prompt = "How to make banana bread?"
formatted_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
# Измерение до инференса
memory_before = psutil.virtual_memory().used
start_time = time.time()
output = model(formatted_prompt, max_new_tokens=500).generations[0].text
end_time = time.time()
# Измерение после инференса
memory_after = psutil.virtual_memory().used
print(f"Inference Time: {end_time - start_time} seconds")
print(f"Memory Used During Inference: {(memory_after - memory_before) / (1024 ** 2)} MB")
Resullt and speed model loaded by from deepsparse import TextGeneration in memory TinyLlama 1.19GB(converted seq_l 128)-- 19 seconds!!
Total Memory: 50.993690490722656 GB
Memory Used: 3.8117218017578125 GB
CPU Model: Intel(R) Xeon(R) CPU @ 2.20GHz
Physical Cores: 4
Logical Cores (including hyperthreading): 8
Core 0: 2.20 GHz
Core 1: 2.20 GHz
Core 2: 2.20 GHz
Core 3: 2.20 GHz
Core 4: 2.20 GHz
Core 5: 2.20 GHz
Core 6: 2.20 GHz
Core 7: 2.20 GHz
Total CPU Frequency: 2.20 GHz
Inference Time: 19.88378143310547 seconds
Memory Used During Inference: 2.390625 MB
Banana bread is a delicious and nutty bread that is easy to make. Here is a recipe for banana bread:
Ingredients:
1 1/2 cups flour
1/2 cup sugar
1/2 cup baking powder
1/2 cup whole milk
1/4 cup oil
1/4 cup eggs
1/4 cup raisins
1/4 cup raisin bread crumbs
1/4 cup pecans
Salt
Sugar
Bread
Flour
Water
Butter
Oil
eggs
milk
raisins
raisin bread crumbs
pecans
salt
baking powder
oil
eggs
milk
flour
sugar
bread
oil
eggs
milk
raisins
raisin bread crumbs
pecans
salt
baking powder
oil
eggs
milk
flour
sugar
bread
oil
eggs
milk
raisins
raisin bread crumbs
pecans
salt
baking powder
oil
eggs
milk
flour
sugar
bread
oil
eggs
milk
raisins
raisin bread crumbs
pecans
salt
baking powder
oil
eggs
milk
flour
sugar
bread
oil
eggs
milk
raisins
raisin bread crumbs
pecans
salt
baking powder
oil
eggs
milk
flour
sugar
bread
oil
eggs
milk
raisins
raisin bread crumbs
pecans
salt
baking powder
oil
eggs
milk
flour
sugar
bread
oil
eggs
milk
raisins
raisin bread crumbs
pecans
salt
And i have 2 questions
Does this apply only to the pipeline or any other output methods? (seq_l) And what method most speed for inferense? (i intrested load model form my disk and memory)
I do not understand the two questions, could you rephrase them, please?
I imagine that if you run the exported post-obcq ONNX model in the deepsparse pipeline (as you do above), setting a small sequence_length on the export may mess up some models. This is because the sequence_length set during export influences the size of the positional embeddings available for the exported model. As a result, you may get unexpected errors. I see that you are getting satisfying results for your model, so maybe that is not the case for this particular network.
@mgoin Could you take a look? Is my hypothesis more or less correct?
As I understand it, no one knows how to solve the export problem without limiting the context window --sequence_length 64. If you leave it the same as in the base model, when exporting, the memory for exporting the model with the original size of 15 GB takes up the entire memory of 83 GB. Does anyone know the methodology of how to solve this using batch size or distributing processing in parts?
@meomeomeome This is a known issue exporting requires a lot of memory, depending on the sequence_length. We'll be noting this as a known issue in the pending 1.7 product release.
Hello @meomeomeome A heads up that 1.7 recently went out. We hope this can address the issue you faced. Thank you! Jeannie / Neural Magic