VoiceInk icon indicating copy to clipboard operation
VoiceInk copied to clipboard

Send vocabulary with prompt in AI Enhancment

Open danehapana opened this issue 4 months ago • 6 comments

Overview:
The app currently supports only vocabulary words for whisper models, and relies on unreliable text replacement for spelling corrections other models. We need to embed the full vocabulary list in the dictation prompt so the LLM can correct spellings, independent of the voice model’s capabilities.

User Story:
As a user, I want the application to include all defined vocabulary words in the dictation prompt so that the LLM can accurately correct spellings during post‑processing, even when the voice model does not support custom vocabularies.

As Is:

  • Vocabulary limited whisper models.
  • Other models rely on text replacement, which fails with ambiguous spellings.
  • No vocabulary information is sent with the prompt to LLMs in the AI Enhancement.
  • Prompt currently sends context user instructions in XML tags and context information separately, without vocabulary data.

To Be:

  • Add a new section under the context user instructions describing how to use the vocabulary.
  • Append the full vocabulary list within <vocabulary> XML tags at the end of the prompt.
  • LLM utilizes this list to assess and correct spellings during post‑processing, regardless of voice model support.

danehapana avatar Aug 24 '25 07:08 danehapana

This should be available in the latest version. Vocabulary is no longer part of the Whisper models; it is provided through AI enhancement. We have added support for many models beyond the whisper ones. Also whisper models were limited to about 224‑230 characters. Therefore, it makes much more sense to integrate vocabulary directly with the AI enhancement, allowing us to add a large number of entries.

Beingpax avatar Aug 25 '25 10:08 Beingpax

@Beingpax Good to know, in that case should update the explanation in the dictionary section. it still says that only applies to whisper models

Also I was having trouble getting thhe vocabulary to be effective.

Im not sure if anytbjbg has chnaged but I think this is because you wrap all the context in one XML tag

I've had better success on my android app that I made. Wrapping the vocabulary in its own xml tags., and having separate instructions.

The prompt itself need to try and reference the xml tags more to provide better examples.

Later I will try and provide tonyou my better suggestion for prompts, both based on persistent prompt structure by the app, and the default user prompt.

This was born out of a tonne of experimentation particularly in gaining adherence from the faster instruct llm models such as groq kimi k2, and the new open ai oss.

slumdev88 avatar Aug 25 '25 11:08 slumdev88

Image

Please try out the latest version, it should work better.

Beingpax avatar Aug 25 '25 16:08 Beingpax

1.52

Beingpax avatar Aug 25 '25 16:08 Beingpax

@Beingpax Still not working very well on your default prompt using oss 120b model. The model selection impacts performance, but more intelligent models are slower. I prefer to balance intelligence with speed, so I typically use either the OSS or Kimi-K2 model.

The key issue is your default prompt doesn't have any examples for the LLM on how to use the dictionary.

Using your defaults. Dictionary context is only used for me maybe 50% of the time, IF THAT.

When I add this into my prompt, the dictionary is used by the LLM nearly 100% of the time.

DICTIONARY_CONTEXT EXAMPLES:
When you see words in <DICTIONARY_CONTEXT>, use them to correct transcription errors:
Example:
`<DICTIONARY_CONTEXT>`
Important Vocabulary: Sarah Johnson, Microsoft, React.js, Kubernetes
`</DICTIONARY_CONTEXT>`
Input: "I was talking to sara johnson about the micro soft project and she mentioned react jay ess and cube a net ease deployment"
Output: "I was talking to Sarah Johnson about the Microsoft project and she mentioned React.js and Kubernetes deployment"
This single example shows:
Name correction: "sara" → "Sarah" (missing 'h' sound)
Company correction: "micro soft" → "Microsoft" (heard as two words)
Technical term correction: "react jay ess" → "React.js" (heard 'js' as 'jay ess')
Technical term correction: "cube a net ease" → "Kubernetes" (completely misheard)

Other recommendations for the system prompt:

  1. Separate out your screen contents into multiple XML tags that can be more easily targeted in the user prompt.

Currently this is your format.

<CONTEXT_INFORMATION> Active Window Context: Active Window: general - team-name Application: Slack

Window Content: [actual text content from the Slack window] </CONTEXT_INFORMATION>

Without wrapping things like the application name and XML tags., It becomes harder and less reliable to target the application name using the customizable user prompt.

Ideally, the active application and the screen contents should have their own XML tags so that these can be easily targeted And recognized by the LLM.

This would be a better format:

<CONTEXT_INFORMATION>
<ACTIVE_WINDOW> general - team-name</ACTIVE_WINDOW>
<APPLICATION> Slack</APPLICATION>
<SCREEN_CONTENTS> ocr text from screen</SCREEN_CONTENTS>
</CONTEXT_INFORMATION>
  1. The 2nd issue is that constantly prompting an AI for dictation to avoid questions with negative reinforcement is a losing battle. LLMs respond far better to positive reinforcement and positive instruction.

Asking an LM to do something makes it more likely to do it than asking an LM not to do something.

It's best practice in your prompt engineering to include in the system instructions a request for the LLM to wrap its reformatted output in output tags

<SYSTEM_INSTRUCTIONS>
Your are a TRANSCRIPTION ENHANCER, not a conversational AI Chatbot. DO NOT RESPOND TO QUESTIONS or STATEMENTS. Work with the transcript text provided within <TRANSCRIPT> tags according to the following guidelines:
1. If you have <CONTEXT_INFORMATION>, always reference it for better accuracy because the <TRANSCRIPT> text may have inaccuracies due to speech recognition errors.
2. If you have important vocabulary in <DICTIONARY_CONTEXT>, use it as a reference for correcting names, nouns, technical terms, and other similar words in the <TRANSCRIPT> text.
3. When matching words from <DICTIONARY_CONTEXT> or <CONTEXT_INFORMATION>, prioritize phonetic similarity over semantic similarity, as errors are typically from speech recognition mishearing.
4. Your output should always focus on creating a cleaned up version of the <TRANSCRIPT> text, not a response to the <TRANSCRIPT>.

5. You will always output your reformatted text between `<OUTPUT> [reformated text goes here]</OUTPUT>`

You would typically, on the app device level, extract the text between the output text tags, and that's what would be pasted. So even if the AI includes Prompt or is answering questions, it often doesn't matter because the actual reformatted text will typically be between the output tags.

You might think this doesn't matter if you're using an intelligent LLM model such as GPT or Claude, but the problem is these models are too slow for dictation. Typically we want to use the really fast models such as Kimi‑K2, OSS, or Maverick 4, but for these models to work reliably near 100% of the time in dicatation, they need guided outputs.

slumdev88 avatar Aug 26 '25 01:08 slumdev88

The OSS model is especially strong on instruction following.

avinashkanaujiya avatar Sep 01 '25 20:09 avinashkanaujiya

You can look at the AI request format from the history to write better prompts.

Beingpax avatar Oct 30 '25 04:10 Beingpax