ipex-llm
ipex-llm copied to clipboard
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Ma...
What are default values of max_generated_tokens, top_k, top_p, and temperature? If user doesn't set all parameters in `generate_kwargs` such as the example below, it should use default values. How do...
I try to transform a string into input llama2-specific and llama3-specific input in the function `completion_to_prompt()` Is there a way to pass parameter **model_option** as a input? or else, I...
Logs use 'bigdl-llm' while converting and loading models into q4 binary format, should use `ipex-llm` ``` bigdl-llm: loading model from ./bigdl_llm_llama_q4_0.bin loading bigdl-llm model: format = ggjt v3 (latest) loading...
## Description Gemma shares the same RotaryEmbedding layer with phi3.
I hope to switch llama2-7b-chat and llama3-8b models. But it cost a lot of memory size if I load both. How to clear one if I am going to load...
## Description Initial patch function for inference ### 2. User API changes Support `llm_patch(train=False, device='xpu', load_in_low_bit='sym_int4')` Only need to add the following code at the beginning to run huggingface inference...
## Description Add continuous-batching-like partial prefilling to reduce the memory peak during prefilling.
## Description support q4_0_rtn
Hi, I saved the LLAVA model in 4bit using generate.py from: https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/llava model = optimize_model(model) #Added these lines below in the generate.py if SAVE_PATH : model.save_low_bit(save_path_model) tokenizer.save_pretrained(save_path_model) print(f"Model and tokenizer...
Below code works when I am using Mixtral model from Ollama directly. But when I use the IPEX-LLM optimized Mixtral model, the tool does not work. This is an easy...