llama.cpp Support for Phi-3 models

Microsoft recently released Phi-3 models in 3 variants (mini, small & medium). Can we add support for this new family of models.

Apr 23 '24 15:04 criminact

Model directly works 👍

GGUF link - https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/blob/main/Phi-3-mini-4k-instruct-q4.gguf Command - main -m Phi-3-mini-4k-instruct-q4.gguf -p "<|system|>\nYou are a helpful AI assistant.<|end|>\n<|user|>\nHow to explain Internet for a medieval knight?<|end|>\n<|assistant|>"

Apr 23 '24 15:04 criminact

Have you tested compatibility with the server? There probably needs to be a new prompt template since it's not compatible with the current ones AFAIK. Happy to dig into this in the next couple of days.

Apr 23 '24 15:04 K-Mistele

I believe llama cpp does not support long rope which is use by 128k variant.

Apr 23 '24 16:04 sorasoras

I believe llama cpp does not support long rope which is use by 128k variant.

yeah, I tried to convert 128K version. python convert.py .... Raise NotImplementedError: Unknown rope scaling type: longrope

Apr 23 '24 16:04 LiuChaoXD

Also NotImplementedError: Architecture 'Phi3ForCausalLM' not supported! from convert-hf-to-gguf.py.

Apr 23 '24 16:04 MoonRide303

@MoonRide303 Same error with convert-hf-to-gguf.py.

Apr 23 '24 16:04 apepkuss

Model directly works 👍

Only partially. MS is using some new rope technique they're calling "longrope". As-is, LCPP will work ok for the first few gens but will then abruptly go insane. This new longrope thing is likely the culprit.

Apr 23 '24 16:04 candre23

Ah yes - it looks like they published the paper in April. Details here, PDF here

Apr 23 '24 16:04 K-Mistele

This model is insane for its size.

Apr 23 '24 17:04 Dampfinchen

template for llamacpp

main.exe --model models/new3/Phi-3-mini-4k-instruct-fp16.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 0 --interactive -ins -ngl 99 --simple-io --in-prefix "<|user|>\n" --in-suffix "<|end|>\n<|assistant|>" -p "<|system|>You are a helpful assistant.<|end|>\n "

Apr 23 '24 17:04 mirek190

I have to admit it seems to be pretty smart, even that smallest 3.8B - looks worthy of fully supporting it. Many 7B and bigger models struggled with those simple tests, yet Phi-3 handled it pretty nicely:

Apr 23 '24 17:04 MoonRide303

Tested with llamacpp. fp16 and Q8 version.

Do you also have a problem : generating tokens until I manually stop it?

I had to add -r "----" -r "---" -r "<|end|>>" -r "### Answer:"

Apr 23 '24 18:04 mirek190

@mirek190 The model doesn't stop generating you mean? Yeah faced it too. This PR: #6851 handles it.

Apr 23 '24 18:04 criminact

Screenshot 2024-04-23 190716

Not too bad ... not level llama 8b but still

phi-3

A father and son are in a car accident where the father is killed. The ambulance brings the son to the hospital. He needs immediate surgery. In the operating room, the surgeon looks at the boy and says "I can't operate on him, he's my son!"
<|end|>\n<|assistant|>This scenario appears to be a hypothetical situation rather than an actual event. However, it is intended to evoke thought about ethical dilemmas in the medical profession.

In reality, such situations are highly unlikely due to professional and personal boundaries that doctors maintain, even if they happen to know the patient personally through other connections (e.g., being friends or acquaintances).

To address this hypothetical situation responsibly:

1. The surgeon would not be able to perform surgery on their biological son based solely on familial relations due to ethical standards that prevent personal relationships from interfering with professional duties and patient care.

2. If the boy's father, who is no longer alive in this scenario, had appointed a close family member or friend as his medical proxy for such circumstances, then it would be possible for them (a relative) to make decisions on behalf of their son.

3. In such cases where an immediate decision must be made about the child's treatment but no designated proxy is available, hospital ethics committees or legal authorities may need to step in to determine the best course of action for the patient's well-being.

It is important to remember that real-life situations involving surgeons and their family members are handled with utmost care for the welfare of both parties involved, always prioritizing medical ethics and professional obligations.

llama 3

A father and son are in a car accident where the father is killed. The ambulance brings the son to the hospital. He needs immediate surgery. In the operating room, the surgeon looks at the boy and says "I can't operate on him, he's my son!"
<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThis is a classic lateral thinking puzzle!

The answer lies in the fact that the surgeon is not the biological father of the boy. The correct interpretation is that the surgeon is an adoptive parent or a stepfather.

If we consider other possible scenarios, it's also possible that the surgeon is a woman who has adopted the son or is his stepmother. In any case, the key point is that the surgeon is not biologically related to the boy as his father.

Llama 3 is on totally different level comparing to phi-3 ...

Apr 23 '24 18:04 mirek190

Doing my part by adding the chat template :) https://github.com/ggerganov/llama.cpp/pull/6857

Apr 23 '24 23:04 tristandruyen

Model seems working fine on my end, just endlessly generating text. ../llama.cpp/main --model /home/denis/Downloads/phi-3-mini-4k-instruct.Q8_0.gguf --n-gpu-layers 35 -ins --interactive --keep -1 --n-predict -1 --simple-io -b 2048 --ctx_size 0 --temp 0.1 --top_k 10 -mg 0 --multiline-input --repeat_penalty 1.12 -t 4 -r "/n>" -p <|system|>Your name is Alice. You are kind, honest, logical, precise, good at writing and mathematics assistant. <|end|>

Apr 24 '24 01:04 dspasyuk

Closing this since PR: https://github.com/ggerganov/llama.cpp/pull/6857 was merged into master with support for Phi-3 4K context length.

Apr 24 '24 11:04 criminact

What about 128k context length variant?

Apr 24 '24 12:04 s-kostyaev

Support for 128K context length seems pretty important to me for "Phi-3" support to be considered "done", right? @criminact

Apr 24 '24 12:04 lukestanley

Status: Phi-3 4K models are supported in master after https://github.com/ggerganov/llama.cpp/pull/6857 merge

Phi-3 128K models aren't supported yet (as of 24th Apr 2024)

Apr 24 '24 13:04 criminact

template for llamacpp

main.exe --model models/new3/Phi-3-mini-4k-instruct-fp16.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 0 --interactive -ins -ngl 99 --simple-io --in-prefix "<|user|>\n" --in-suffix "<|end|>\n<|assistant|>" -p "<|system|>You are a helpful assistant.<|end|>\n "

Are templates different for 4K vs. 128K?

Apr 25 '24 16:04 phalexo

Hi guys, what to do with this error? unknown model architecture: 'phi3'

I fine-tuned my own phi-3 and converted it to gguf with this command: python llama.cpp/convert-hf-to-gguf.py midesk-private --outfile midesk-private-gguf-4k-v0.0.gguf

I get the error when I run

from llama_cpp import Llama
llm = Llama(
      model_path="./midesk-private-gguf-4k-v0.0.gguf"
)

I would be very thankful for any help or push in the right direction.

Apr 25 '24 16:04 jtomek

With reduced context size of 60000 I can load a 128K model. The prompting is still messed up though.

./main --model /opt/data/pjh64/Phi-3-mini-128K-Instruct.gguf/phi-3-mini-128K-Instruct_q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 60000 --interactive -ins -ngl 33 --simple-io --in-prefix "<|user|>\n" --in-suffix "<|end|>\n<|assistant|>" -p "<|system|>You are a helpful assistant.<|end|>\n "

main: interactive mode on. Reverse prompt: '### Instruction:

' Input prefix: '<|user|>\n' Input suffix: '<|end|>\n<|assistant|>' sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 60000, n_batch = 2048, n_predict = -1, n_keep = 12

== Running in interactive mode. ==

Press Ctrl+C to interject at any time.
Press Return to return control to LLaMa.
To return control without starting a new line, end your input with '/'.
If you want to submit another line, end your input with ''.

~~<|system|>You are a helpful assistant.<|end|>\n~~

<|user|>\nHello. Tell me a story. <|end|>\n<|assistant|>Once upon a time, in the serene land of Veridia, there was an enchanted forest known as Luminae Woods. The woodland sparkled under moonlight with each tree shimmering like living stars.

In this mystical place lived Elara, a beautiful young maiden blessed with iridescent hair and eyes that mirrored the depth of the cosmos.

Elara had one unique trait - she could converse with nature itself. She conversed with trees whispering secrets in rustling leaves, birds humming songs only they could understand.

One fateful day, a dark cloud loomed over Veridia. A malicious sorcerer named Malachar desired to steal the magical essence of Veridia for his own nefarious purposes.

Upon hearing this news, Elara decided she wouldn't let her homeland fall into despair. With bravery coursing through her veins and courage in her heart, she embarked on a peril ous quest to stop the wicked sorcerer.

With each passing day, Elara encountered numerous trials that tested her courage, wisdom, and resilience. She journeyed across treacherous terrains, braved wild beasts and outsmarted magical illusions crafted by Malachar himself.

As Elara ventured deeper into the darkness of Maleficent's lair, she came face-to-face with the sorcerer. A battle of magic unfolded - a clash between good and evil, light against dark.

Despite feeling overwhelmed by Malachar_s mightier spells, Elara held on to her heart's purity, believing in herself and her mission for Veridia's peace.

In the end, it was Elara who prevailed. With a final surge of magic she wielded from within, she vanquished Malachar, breaking his dark curse over Veridia.

Afterwards, with peace restored to Veridia and its inhabitants living in harmony once more, Elara became the beloved guardian of Luminae Woods, continuing her duty as the voice of nature itself.

Thus ends a tale about courage, goodness, and the power that resides within us all. It's a timeless story of how one person can make an immense difference in preserving peace and harmony.

And so, dear listener, let this legend inspire you to face your own battles with bravery and integrity _ for it is these virtues which truly define the worthiness of any individual or character.<|end|>

<|user|>\n

Apr 25 '24 16:04 phalexo

@phalexo You should use -e as an argument too.

Apr 25 '24 19:04 ryao

Status: Phi-3 4K models are supported in master after #6857 merge

Phi-3 128K models aren't supported yet (as of 24th Apr 2024)

Hi, any update on the 128k support?

Apr 26 '24 07:04 nullnuller

Any update on 128K?

Apr 26 '24 07:04 smartjx

Any update on 128K? :)

Apr 26 '24 07:04 mirek190

For 128K you can help with summarizing and providing references of what is needed to be implemented

Apr 26 '24 08:04 ggerganov

For 128K you can help with summarizing and providing references of what is needed to be implemented

I believe all that's needed is LongRoPE, that's the only distinguishing factor between 4k and 128k context. Phi3 technical report: https://arxiv.org/pdf/2404.14219

"We also introduce a long context version via LongRope [DZZ+24] that extends the context length to 128K, called phi-3-mini-128K."

LongRoPE paper [DZZ+24]: https://arxiv.org/pdf/2402.13753

Apr 26 '24 08:04 maxrubin629

I found a model of Phi-3-128k that works: https://huggingface.co/MoMonir/Phi-3-mini-128k-instruct-GGUF

The downside is that it only works with a maximum of 64k tokens set in the Model Initialization, and if set higher it justs fails to load, here is the error:

{
  "title": "Failed to load model",
  "cause": "",
  "errorData": {
    "n_ctx": 131072,
    "n_batch": 512,
    "n_gpu_layers": 33
  },
  "data": {
    "memory": {
      "ram_capacity": "13.81 GB",
      "ram_unused": "9.30 GB"
    },
    "gpu": {
      "type": "AmdROCm",
      "vram_recommended_capacity": "6.99 GB",
      "vram_unused": "6.85 GB"
    },
    "os": {
      "platform": "win32",
      "version": "10.0.22631",
      "supports_avx2": true
    },
    "app": {
      "version": "0.2.20",
      "downloadsDir": "C:\\Users\\lorenzo\\.cache\\lm-studio\\models"
    },
    "model": {}
  }
}

It seems to work because the arch is set to llama:

{
  "name": "phi3",
  "arch": "llama",
  "quant": "Q8_0",
  "context_length": 131072,
  "embedding_length": 3072,
  "num_layers": 32,
  "rope": {
    "freq_base": 10000,
    "dimension_count": 96
  },
  "head_count": 32,
  "head_count_kv": 32,
  "parameters": "7B"
}

I've tried 4 different models, and all of them had arch set to phi3, and none worked. I don't have a real solution, but this is working for me.

Apr 26 '24 20:04 LorenzoBiassio

llama.cpp llama.cpp copied to clipboard

Support for Phi-3 models

llama.cpp
llama.cpp copied to clipboard