Adding workflows with the "classifier/router pattern" and shell scripts
See https://www.anthropic.com/engineering/building-effective-agents
The idea would be that the "agent" mode (when we start with "Hey"), would transcribe the query, then send it through ChatGPT or other for classification between tasks that are defined (using a prompt in the settings), then depending on the classification, it would run another prompt and/or a shell script on the transcription
This could allow for custom workflows like "Hey, add this idea to my Obsidian todo", or "Hey, search google for 'How to clean terrazzo tiles'"
We could even embed a small n8n widget https://n8n.io/workflows/ with a Javascript engine to let the user define his complex workflows
Check this for instance, about how to communicate between a webview including n8n or something similar, and the web app https://www.swiftwithvincent.com/blog/how-to-run-native-code-from-a-wkwebview
A first "lighter" version (without including a workflow visualization and construction framework) would be to just define "tasks" (eg addObsidianTodo, searchGoogle) which have a few json parameters from the query, and query the LLM with "forced json output" to get the task name and parameters, then call a shell script with the JSON parameter for each task
BTW could be useful to add some more details on the Swift build:
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
# if needed, add install required build tools eg
brew install cmake
make build
./build-xcframework.sh
# Check that the build works
sh ./models/download-ggml-model.sh base.en
./build/bin/whisper-cli -f samples/jfk.wav
then you need to drag-and-drop the framework at the right location in XCode, check instructions at https://github.com/ggerganov/whisper.cpp/tree/master/examples/whisper.swiftui
Should also drag and drop the framework here (whisper.cpp/build-macos/framework) in the project settings
and also drag and drop the .dylib from whisper.cpp/build/src/**.dylib here in "Build phases"
(of course we could go fancy and add a MCP server connection, or something, but to be user-friendly for those who don't have all this, perhaps a minimal settings pane would suffice)
- add a workflow with "title", "prompt", "jsonSchema" (to restrict the output)
- then have possibly an extra "system prompt" if needed to give more instruction to the classifier (in case there are subtle distinctions between the workflows that the LLM could be confused about)
From this generate a big prompt to feed to the LLM:
--- Task:
You are a classifier LLM, your task is to get the transcript (provided at the end of instructions in <transcript></transcript>), and provide the id of the workflow to run for the task, with the parameters
--- Additional user-provided infos: // <- only if systemPrompt is not empty
{{ systemPrompt }}
--- Description of workflows
- id: w1 // <- simple index based ids for each workflow, so the LLM can refer to it easily
- description:
{{ first workflow description }}
- expected output:
{{ first workflow JSON schema }}
// ...repeat this for all workflows
---- Output format
You will return the classification as a JSON object
{
workflow_id: "...", // eg "w1"
workflow_args: ..., // adhere to the corresponding workflow json schema
}
---- Transcription
<transcript>
{{ userTranscript }}
</transcript>
I coded (with Claude, so dirty) a first implementation which seems to work somehow (but haven't written the code to plug it in actual shell scripts)
you can find it here: https://github.com/Beingpax/VoiceInk/pull/19
(this implementation is VERY basic), it would be MUCH better to have some way to constrain the output of the LLM to be json, and to do a lot of checks to make sure everything is good
for now the idea is that when you detect a workflow, it will run the associated script with those environment variable set
WORKFLOW_ARGS={{json serialization of workflow_args}}
WORKFLOW_ARG_{{key.toUpper()}} == {{json serialization of workflow_args.key}} // for each key in the root workflow_args
Example script
#!/bin/bash
echo "Args: $WORKFLOW_ARGS"
echo "- engine: $WORKFLOW_ARG_ENGINE"
echo "- query: $WORKFLOW_ARG_QUERY"
echo > /tmp/args.txt <<EOS
Args: $WORKFLOW_ARGS
- engine: $WORKFLOW_ARG_ENGINE
- query: $WORKFLOW_ARG_QUERY
EOS
if [ "$WORKFLOW_ARG_ENGINE" = "google" ]; then
echo "Searching Google for $WORKFLOW_ARG_QUERY"
# search google
open "https://www.google.com/search?q=$WORKFLOW_ARG_QUERY"
elif [ "$WORKFLOW_ARG_ENGINE" = "duckduckgo" ]; then
echo "Searching DuckDuckGo for $WORKFLOW_ARG_QUERY"
# search duckduckgo
open "https://duckduckgo.com/html/?q=$WORKFLOW_ARG_QUERY"
else
echo "Invalid engine: $WORKFLOW_ARG_ENGINE"
exit 1
fi
It works :)
I guess actually we could even implement a way to connect to any MCP server, if we want to have something even more powerful?
Hi @maelp
Thank you for your work on implementing this feature. I appreciate the effort and thought you've put into this feature.
While it's a really good and interesting concept that could be useful for some users, it doesn't align with the core direction I envision for VoiceInk, at the moment(might change in future though 😁).
As mentioned in the contribution guidelines, we ask that contributors create an issue and discuss potential features before implementation. This ensures alignment and prevents situations like this.
You have a really good implementation, I encourage you to consider maintaining it as a separate fork.
Again, I appreciate your contribution and interest in improving VoiceInk.
I guess you already know how to remove the restrictions for the license verification, else I can assist you via e-mail on that. Mail me at [email protected]
No worries! Would you mine telling more about why it doesn't fit in the core direction?
Couldn't you otherwise envision a system of "plugins" so at least this could be added to the core without having to be maintained separately?
Currently, there are a lot of improvements that need to be done with the dictation feature alone.
So, adding this would add more load for maintaining it separately. That is one of the reasons why I am a little hesitant right now.
When you mean plugin system, what do you mean? can you explain a little.
I totally understand, although I would be happy to help maintaining it if you provide some guidance (I’m not a Swift expert)
I think the “basics” are almost already implemented, with not much to add, except some more robust error-handling
When you mean plugin system, what do you mean? can you explain a little.
If you allowed external people to develop a simple "plugin system" (perhaps using WASM or javascript for the code, and a simple swift pane in the UI for settings), this would allow a community of people to create code to extend VoiceInk without having to go through the whole "fork and rebuild locally"
so they could provide .wasm / .js plugin that could be drag-and-dropped in the UI somehow, and those would just provide a simple "configuration pane" and run some JS code after each query
perhaps that could be enough? but then again this would be perhaps some work for you
The "cool" thing about my little workflow implementation is that it's "almost already" a very simple plugin system: just write a shell script for anything you'd like to run, and write a prompt to get the args, and you can have almost a "marketplace" of script to build your own Siri
in that way, your app become a bit like a mini "Alfred", except people create voice commands rather than command lines
When you mean plugin system, what do you mean? can you explain a little.
Or even, without going so far as what I described, just having a code pattern like this one https://www.swiftbysundell.com/articles/making-swift-code-extensible-through-plugins/
would allow me to perhaps just write a small "extension framework" that I could easily compile alongside VoiceInk even after you make big code changes, and would make it easier to maintain community plugins
Ok, I will look into this properly and update in the next releases.
@maelp Can you make a short youtube video about this for new users? Like a beginner-friendly tutorial, like 3-4 minute video talking about what this is and how it will work and how it can be used.
Definitely! I will tell you when it's done
@Beingpax okay it's here, totally unedited but it shows what is possible :) https://www.youtube.com/watch?v=GGYOzsknk-k
@Beingpax let me know if that video is what you had in mind?
Would be cool to use as a plugin (can be manually adding it as a file or something like that, for now)!
Honestly, it´s not for everyone but it´s interesting and promising.
If you guys can implement a plugin system that does not require a lot of core maintenance, then it would simply increase the general value of the app even more!
okay it's here, totally unedited but it shows what is possible :) https://www.youtube.com/watch?v=GGYOzsknk-k
Really cool demo! This has so much potential, especially with how fast the Gemini model is (the local model is the bottleneck now it seems, right?)
Something that will be quite advanced to add and definitely should be added later: Raycast is doing something similar with a great UI, it can be some good inspiration on how to add this feature for less technical people:
Demo video:
https://www.youtube.com/watch?v=sHIlFKKaq0A
@maelp Or can you implement this 🤓?
While Raycast is closed source of course, the extensions are not. @Beingpax So if you are able to support the Raycast extensions, you immediately have access to hundreds of apps. Every extension works the same way. The package.json lists all the possible commands.
Obsidian example:
https://github.com/raycast/extensions/blob/a067019eb06e6427846df30dedbb913a0bf1a7da/extensions/obsidian/package.json
Now, the fun part is with the extensions that are AI enabled. The package.json contains an ai object with instructions and tools.
https://github.com/raycast/extensions/blob/a57cbbae7b0c5656a2c565a5519da55939e64a9f/extensions/notion/package.json
Some other well known apps that are AI enabled: Apple Notes, Siri, Messages, Google Calendar, Spotify, Github, iTerm, Reminders, Vercel, TablePlus, Safari, ...
And those you could tie into a workflow functionality 😄! I'm the first to admit this will be a really big effort, but this would make workflows really easy to configure and you would access so many services/apps without building the integrations yourself.
@marijnbent indeed! It shouldn't be very hard to implement this and be compatible with the actions, but it would require shipping nodejs (I guess) with VoiceInk in order to run/install the actions... so perhaps this would be better suited as a kind of "plugin"?
Even better than Raycast plugins, I think the community is going towards "MCP servers" for actions, so that you are not constrained by language
It would be even perhaps simpler: a setting in VoiceInk would allow you to define MCP servers you want to connect to, and the AI would do all the function calling for you
@marijnbent @Beingpax check for instance:
- https://nshipster.com/model-context-protocol/ and https://github.com/loopwork-ai/mcp-swift-sdk (seems a really good implementation)
- https://github.com/1amageek/swift-context-protocol
- https://github.com/gsabran/mcp-swift-sdk
Could be nice to have both:
- a server in the app (which would then "replace" my workflow which is mostly a "mini-MCP server" if you want)
- a client, which could connect to any other MCP server the user wants to install on his laptop, check eg https://mcp.so/
@marijnbent the easiest way for you to use custom Raycast actions would be to use my plugin, and just create a single plugin called "send-action" which says "send the query to Raycast" and output just a JSON of format "{ query: "string" }" and then the bash script should just send the raw query to Raycast for processing with MCP server (I guess it's possible in their app through "internal url" links)
@Beingpax to summarize the discussion:
- I think my first PR is a good starting point (however, it can be improved by adding Swift libs for JSON Schema validation, and replace the JSON input by an optional JSON schema to parse the answer, also for models that provide it, give the JSONSchema for "constraining the output of the result")
- to integrate with Raycast, I think the easiest is what I suggested above, a single "handle with raycast" workflow which just forwards the request to raycast, so that raycast can use its actions, so you could say "Ask Raycast to check my email" it it would just do a handle-with-raycast action with the proper query, and hand it out there
- the "long-term solution" would probably be to look at proper MCP client/server handling, with a small "in-process" MCP server where people can add their bash script (similar to what I did, but more robust) so it's easy and convenient for new user, and a MCP client which allows to query any other MCP server (on your laptop or on the web) for power-users who want this
1 and 2 are very easy to do from my PR, just a bit of cleaning-up (tell me if I can help) 3 is more involved (but also much more powerful, turning VoiceInk into a full-fledged transcription + AI assistant)