Jlama
Jlama copied to clipboard
Jlama is a modern LLM inference engine for Java
π¦ Jlama: A modern LLM inference engine for Java
π Features
Model Support:
- Gemma Models
- Llama & Llama2 & Llama3 Models
- Mistral & Mixtral Models
- GPT-2 Models
- BERT Models
- BPE Tokenizers
- WordPiece Tokenizers
Implements:
- Flash Attention
- Mixture of Experts
- Huggingface SafeTensors model and tokenizer format
- Support for F32, F16, BF16 types
- Support for Q8, Q4 model quantization
- Fast GEMM operations
- Distributed Inference!
Jlama is requires Java 20 or later and utilizes the new Vector API for faster inference.
β Give us a star!
Like what you see? Please consider giving this a star (β )!
π€ What is it used for?
Add LLM Inference directly to your Java application.
π¬ Demo
Jlama includes a simple UI if you just want to chat with an llm.
./run-cli.sh download tjake/llama2-7b-chat-hf-jlama-Q4
./run-cli.sh restapi models/llama2-7b-chat-hf-jlama-Q4
open browser to http://localhost:8080/
π¨βπ» How to use in your Java project
The simplest way to use Jlama is with the Langchain4j Integration.
Jlama also includes an OpenAI chat completion api that can be used with many tools in the AI ecosystem.
./run-cli.sh restapi tjake/llama2-7b-chat-hf-jlama-Q4
If you would like to embed Jlama directly, add the following maven dependencies to your project:
<dependency>
<groupId>com.github.tjake</groupId>
<artifactId>jlama-core</artifactId>
<version>${jlama.version}</version>
</dependency>
<dependency>
<groupId>com.github.tjake</groupId>
<artifactId>jlama-native</artifactId>
<!-- supports linux-x86_64, macos-x86_64/aarch_64, windows-x86_64
Use https://github.com/trustin/os-maven-plugin to detect os and arch -->
<classifier>${os.detected.name}-${os.detected.arch}</classifier>
<version>${jlama.version}</version>
</dependency>
Then you can use the Model classes to run models:
public void sample() throws IOException {
String model = "tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4";
String workingDirectory = "./models";
String prompt = "What is the best season to plant avocados?";
// Downloads the model or just returns the local path if it's already downloaded
File localModelPath = SafeTensorSupport.maybeDownloadModel(workingDirectory, model);
// Loads the quantized model and specified use of quantized memory
AbstractModel m = ModelSupport.loadModel(localModelPath, DType.F32, DType.I8);
// Checks if the model supports chat prompting and adds prompt in the expected format for this model
if (m.promptSupport().isPresent()) {
prompt = m.promptSupport().get().newBuilder()
.addSystemMessage("You are a helpful chatbot who writes short responses.")
.addUserMessage(prompt)
.build();
}
System.out.println("Prompt: " + prompt + "\n");
// Generates a response to the prompt and prints it
// The api allows for streaming or non-streaming responses
// The response is generated with a temperature of 0.7 and a max token length of 256
GenerateResponse r = m.generate(UUID.randomUUID(), prompt, 0.7f, 256, false, (s, f) -> System.out.print(s));
System.out.println(r.toString());
}
π΅οΈββοΈ How to use as a local client
Jlama includes a cli tool to run models via the run-cli.sh
command.
Before you do that first download one or more models from huggingface.
Use the ./run-cli.sh download
command to download models from huggingface.
./run-cli.sh download gpt2-medium
./run-cli.sh download -t XXXXXXXX meta-llama/Llama-2-7b-chat-hf
./run-cli.sh download intfloat/e5-small-v2
Then run the cli tool to chat with the model or complete a prompt.
Quanitzation is supported with the -q
flag. Or you can use pre-quantized models
located in my huggingface repo.
./run-cli.sh complete -p "The best part of waking up is " -t 0.7 -tc 16 -q Q4 -wq I8 models/Llama-2-7b-chat-hf
./run-cli.sh chat -s "You are a professional comedian" models/llama2-7b-chat-hf-jlama-Q4
π§ͺ Examples
Llama 2 7B
You: Tell me a joke about cats. Include emojis.
Jlama: Sure, here's a joke for you:
Why did the cat join a band? πΈπ±
Because he wanted to be the purr-fect drummer! πΉπΎ
I hope you found that purr-fectly amusing! πΈπ±
elapsed: 11s, prompt 38.0ms per token, gen 146.2ms per token
You: Another one
Jlama: Of course! Here's another one:
Why did the cat bring a ball of yarn to the party? ππ§Ά
Because he wanted to have a paw-ty! πΉπ
I hope that one made you smile! ππ±
elapsed: 11s, prompt 26.0ms per token, gen 148.4ms per token
πΊοΈ Roadmap
- Support more and more models
-
Add pure java tokenizers -
Support Quantization (e.g. k-quantization) - Add LoRA support
- GraalVM support
-
Add distributed inference
π·οΈ License and Citation
The code is available under Apache License.
If you find this project helpful in your research, please cite this work at
@misc{jlama2024,
title = {Jlama: A modern Java inference engine for large language models},
url = {https://github.com/tjake/jlama},
author = {T Jake Luciani},
month = {January},
year = {2024}
}