candle
candle copied to clipboard
How to run LLama-3 or Phi with more then 4096 prompt tokens?
Could you please show me an example where LLama-3 model used (better GGUF quantized) and initial prompt is more then 4096 tokens long? Or better 16-64K long (for RAG). Currently everything I do ends with error: In this code: let logits = model.forward(&input, 0); // input is > 4096 tokens
Error: narrow invalid args start + len > dim_len: [4096, 64], dim: 0, start: 0, len:4240
Model used: https://huggingface.co/MaziyarPanahi/Llama-3-8B-Instruct-64k-GGUF
Thank you a lot in advance!