neural-speed icon indicating copy to clipboard operation
neural-speed copied to clipboard

An innovative library for efficient LLM inference via low-bit quantization

Results 40 neural-speed issues
Sort by recently updated
recently updated
newest added

## Type of Change Stability.ai open sourced StableLM-2-12B, which has a different architecture than its 1.6B & 3B counterparts. This PR adds support for these models: stabilityai/stablelm-2-12b & stabilityai/stablelm-2-12b-chat. ##...

` model_name = "meta-llama/Llama-2-7b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = Model() model.init(model_name, use_quant=True, weight_dtype="int4", compute_dtype="int8") tokens = tokenizer("What's your favorite animal?", return_tensors='pt').input_ids outputs = model.generate(tokens, num_beams=2, do_sample=False, max_new_tokens=10) text =...

Please add support for the phi-3-mini-128k(context length) model in neural-speed.

When I load "meta-llama/Meta-Llama-3-8B-Instruct" model like this ` from transformers import AutoTokenizer, TextStreamer from intel_extension_for_transformers.transformers import AutoModelForCausalLM model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # Hugging Face model_id or local model tokenizer = AutoTokenizer.from_pretrained(model_name,...

i'm not well versed with python and where do i put the downloaded llama-2-7b-chat.Q4_0.gguf file? i can make llama.cpp work real easy on my laptop but i cant seem to...

## Type of Change feature or bug fix or documentation or others API changed or not ## Description detail description Issues: xxx ## Expected Behavior & Potential Risk the expected...

I’ve discovered a performance gap between the Neural Speed Matmul operator and the Llama.cpp operator in the Neural-Speed repository. This issue was identified while running a benchmark with the ONNXRuntime-GenAI...

enhancement

## Type of Change feature or bug fix or documentation or others API changed or not not ## Description Add llama2 model acc ut for piqa task, llama2 rtn quant...

CI

i understand this is intel repo but curious will amd work as well or... what kind of architecture / intel chip set is best used with this repo? about to...

an example of TP has been provided by Neural speed document: mpirun -np 2 -bind-to=socket ./build/bin/main_gptj -m ne-q4_0.bin --seed 1234 -t 56 -c 68 -n 32 -p "Once upon a...