VoiceInk icon indicating copy to clipboard operation
VoiceInk copied to clipboard

Adding workflows with the "classifier/router pattern" and shell scripts

Open maelp opened this issue 9 months ago • 50 comments

Image See https://www.anthropic.com/engineering/building-effective-agents

The idea would be that the "agent" mode (when we start with "Hey"), would transcribe the query, then send it through ChatGPT or other for classification between tasks that are defined (using a prompt in the settings), then depending on the classification, it would run another prompt and/or a shell script on the transcription

This could allow for custom workflows like "Hey, add this idea to my Obsidian todo", or "Hey, search google for 'How to clean terrazzo tiles'"

We could even embed a small n8n widget https://n8n.io/workflows/ with a Javascript engine to let the user define his complex workflows

maelp avatar Mar 18 '25 09:03 maelp

Check this for instance, about how to communicate between a webview including n8n or something similar, and the web app https://www.swiftwithvincent.com/blog/how-to-run-native-code-from-a-wkwebview

maelp avatar Mar 18 '25 09:03 maelp

A first "lighter" version (without including a workflow visualization and construction framework) would be to just define "tasks" (eg addObsidianTodo, searchGoogle) which have a few json parameters from the query, and query the LLM with "forced json output" to get the task name and parameters, then call a shell script with the JSON parameter for each task

maelp avatar Mar 18 '25 10:03 maelp

BTW could be useful to add some more details on the Swift build:

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp

# if needed, add install required build tools eg
brew install cmake

make build
./build-xcframework.sh

# Check that the build works
sh ./models/download-ggml-model.sh base.en
./build/bin/whisper-cli -f samples/jfk.wav

then you need to drag-and-drop the framework at the right location in XCode, check instructions at https://github.com/ggerganov/whisper.cpp/tree/master/examples/whisper.swiftui

maelp avatar Mar 18 '25 10:03 maelp

Should also drag and drop the framework here (whisper.cpp/build-macos/framework) in the project settings

Image

and also drag and drop the .dylib from whisper.cpp/build/src/**.dylib here in "Build phases"

Image

maelp avatar Mar 18 '25 10:03 maelp

(of course we could go fancy and add a MCP server connection, or something, but to be user-friendly for those who don't have all this, perhaps a minimal settings pane would suffice)

  • add a workflow with "title", "prompt", "jsonSchema" (to restrict the output)
  • then have possibly an extra "system prompt" if needed to give more instruction to the classifier (in case there are subtle distinctions between the workflows that the LLM could be confused about)

From this generate a big prompt to feed to the LLM:

--- Task:
You are a classifier LLM, your task is to get the transcript (provided at the end of instructions in <transcript></transcript>), and provide the id of the workflow to run for the task, with the parameters
--- Additional user-provided infos: // <- only if systemPrompt is not empty
{{ systemPrompt }}
--- Description of workflows
- id: w1 // <- simple index based ids for each workflow, so the LLM can refer to it easily
- description:
{{ first workflow description }}
- expected output:
{{ first workflow JSON schema }}

// ...repeat this for all workflows
---- Output format
You will return the classification as a JSON object
{
  workflow_id: "...", // eg "w1"
  workflow_args: ..., // adhere to the corresponding workflow json schema
}
---- Transcription
<transcript>
{{ userTranscript }}
</transcript>

maelp avatar Mar 18 '25 11:03 maelp

I coded (with Claude, so dirty) a first implementation which seems to work somehow (but haven't written the code to plug it in actual shell scripts)

Image

you can find it here: https://github.com/Beingpax/VoiceInk/pull/19

maelp avatar Mar 18 '25 11:03 maelp

(this implementation is VERY basic), it would be MUCH better to have some way to constrain the output of the LLM to be json, and to do a lot of checks to make sure everything is good

for now the idea is that when you detect a workflow, it will run the associated script with those environment variable set

WORKFLOW_ARGS={{json serialization of workflow_args}}
WORKFLOW_ARG_{{key.toUpper()}} == {{json serialization of workflow_args.key}} // for each key in the root workflow_args

maelp avatar Mar 18 '25 11:03 maelp

Example script

#!/bin/bash
echo "Args: $WORKFLOW_ARGS"
echo "- engine: $WORKFLOW_ARG_ENGINE"
echo "- query: $WORKFLOW_ARG_QUERY"

echo > /tmp/args.txt <<EOS
Args: $WORKFLOW_ARGS
- engine: $WORKFLOW_ARG_ENGINE
- query: $WORKFLOW_ARG_QUERY
EOS

if [ "$WORKFLOW_ARG_ENGINE" = "google" ]; then
    echo "Searching Google for $WORKFLOW_ARG_QUERY"
    # search google
    open "https://www.google.com/search?q=$WORKFLOW_ARG_QUERY"
elif [ "$WORKFLOW_ARG_ENGINE" = "duckduckgo" ]; then
    echo "Searching DuckDuckGo for $WORKFLOW_ARG_QUERY"
    # search duckduckgo
    open "https://duckduckgo.com/html/?q=$WORKFLOW_ARG_QUERY"
else
    echo "Invalid engine: $WORKFLOW_ARG_ENGINE"
    exit 1
fi

maelp avatar Mar 18 '25 12:03 maelp

It works :)

Image

maelp avatar Mar 18 '25 12:03 maelp

I guess actually we could even implement a way to connect to any MCP server, if we want to have something even more powerful?

maelp avatar Mar 18 '25 14:03 maelp

Hi @maelp

Thank you for your work on implementing this feature. I appreciate the effort and thought you've put into this feature.

While it's a really good and interesting concept that could be useful for some users, it doesn't align with the core direction I envision for VoiceInk, at the moment(might change in future though 😁).

As mentioned in the contribution guidelines, we ask that contributors create an issue and discuss potential features before implementation. This ensures alignment and prevents situations like this.

You have a really good implementation, I encourage you to consider maintaining it as a separate fork.

Again, I appreciate your contribution and interest in improving VoiceInk.

I guess you already know how to remove the restrictions for the license verification, else I can assist you via e-mail on that. Mail me at [email protected]

Beingpax avatar Mar 18 '25 14:03 Beingpax

No worries! Would you mine telling more about why it doesn't fit in the core direction?

Couldn't you otherwise envision a system of "plugins" so at least this could be added to the core without having to be maintained separately?

maelp avatar Mar 18 '25 14:03 maelp

Currently, there are a lot of improvements that need to be done with the dictation feature alone.

So, adding this would add more load for maintaining it separately. That is one of the reasons why I am a little hesitant right now.

Beingpax avatar Mar 18 '25 14:03 Beingpax

When you mean plugin system, what do you mean? can you explain a little.

Beingpax avatar Mar 18 '25 14:03 Beingpax

I totally understand, although I would be happy to help maintaining it if you provide some guidance (I’m not a Swift expert)

I think the “basics” are almost already implemented, with not much to add, except some more robust error-handling

maelp avatar Mar 18 '25 14:03 maelp

When you mean plugin system, what do you mean? can you explain a little.

If you allowed external people to develop a simple "plugin system" (perhaps using WASM or javascript for the code, and a simple swift pane in the UI for settings), this would allow a community of people to create code to extend VoiceInk without having to go through the whole "fork and rebuild locally"

so they could provide .wasm / .js plugin that could be drag-and-dropped in the UI somehow, and those would just provide a simple "configuration pane" and run some JS code after each query

perhaps that could be enough? but then again this would be perhaps some work for you

The "cool" thing about my little workflow implementation is that it's "almost already" a very simple plugin system: just write a shell script for anything you'd like to run, and write a prompt to get the args, and you can have almost a "marketplace" of script to build your own Siri

in that way, your app become a bit like a mini "Alfred", except people create voice commands rather than command lines

maelp avatar Mar 18 '25 14:03 maelp

When you mean plugin system, what do you mean? can you explain a little.

Or even, without going so far as what I described, just having a code pattern like this one https://www.swiftbysundell.com/articles/making-swift-code-extensible-through-plugins/

would allow me to perhaps just write a small "extension framework" that I could easily compile alongside VoiceInk even after you make big code changes, and would make it easier to maintain community plugins

maelp avatar Mar 18 '25 15:03 maelp

Ok, I will look into this properly and update in the next releases.

Beingpax avatar Mar 18 '25 15:03 Beingpax

@maelp Can you make a short youtube video about this for new users? Like a beginner-friendly tutorial, like 3-4 minute video talking about what this is and how it will work and how it can be used.

Beingpax avatar Mar 18 '25 15:03 Beingpax

Definitely! I will tell you when it's done

maelp avatar Mar 18 '25 15:03 maelp

@Beingpax okay it's here, totally unedited but it shows what is possible :) https://www.youtube.com/watch?v=GGYOzsknk-k

maelp avatar Mar 18 '25 16:03 maelp

@Beingpax let me know if that video is what you had in mind?

maelp avatar Mar 18 '25 17:03 maelp

Would be cool to use as a plugin (can be manually adding it as a file or something like that, for now)!

marijnbent avatar Mar 18 '25 20:03 marijnbent

Honestly, it´s not for everyone but it´s interesting and promising.

If you guys can implement a plugin system that does not require a lot of core maintenance, then it would simply increase the general value of the app even more!

rmasata avatar Mar 18 '25 20:03 rmasata

okay it's here, totally unedited but it shows what is possible :) https://www.youtube.com/watch?v=GGYOzsknk-k

Really cool demo! This has so much potential, especially with how fast the Gemini model is (the local model is the bottleneck now it seems, right?)

Something that will be quite advanced to add and definitely should be added later: Raycast is doing something similar with a great UI, it can be some good inspiration on how to add this feature for less technical people:

Demo video:

https://www.youtube.com/watch?v=sHIlFKKaq0A

@maelp Or can you implement this 🤓?

While Raycast is closed source of course, the extensions are not. @Beingpax So if you are able to support the Raycast extensions, you immediately have access to hundreds of apps. Every extension works the same way. The package.json lists all the possible commands.

Obsidian example:

https://github.com/raycast/extensions/blob/a067019eb06e6427846df30dedbb913a0bf1a7da/extensions/obsidian/package.json

Now, the fun part is with the extensions that are AI enabled. The package.json contains an ai object with instructions and tools.

https://github.com/raycast/extensions/blob/a57cbbae7b0c5656a2c565a5519da55939e64a9f/extensions/notion/package.json

Some other well known apps that are AI enabled: Apple Notes, Siri, Messages, Google Calendar, Spotify, Github, iTerm, Reminders, Vercel, TablePlus, Safari, ...

And those you could tie into a workflow functionality 😄! I'm the first to admit this will be a really big effort, but this would make workflows really easy to configure and you would access so many services/apps without building the integrations yourself.

marijnbent avatar Mar 18 '25 21:03 marijnbent

@Beingpax let me know if that video is what you had in mind?

Yes @maelp. Thank You

Beingpax avatar Mar 19 '25 06:03 Beingpax

@marijnbent indeed! It shouldn't be very hard to implement this and be compatible with the actions, but it would require shipping nodejs (I guess) with VoiceInk in order to run/install the actions... so perhaps this would be better suited as a kind of "plugin"?

Even better than Raycast plugins, I think the community is going towards "MCP servers" for actions, so that you are not constrained by language

It would be even perhaps simpler: a setting in VoiceInk would allow you to define MCP servers you want to connect to, and the AI would do all the function calling for you

maelp avatar Mar 19 '25 06:03 maelp

@marijnbent @Beingpax check for instance:

  • https://nshipster.com/model-context-protocol/ and https://github.com/loopwork-ai/mcp-swift-sdk (seems a really good implementation)
  • https://github.com/1amageek/swift-context-protocol
  • https://github.com/gsabran/mcp-swift-sdk

Could be nice to have both:

  • a server in the app (which would then "replace" my workflow which is mostly a "mini-MCP server" if you want)
  • a client, which could connect to any other MCP server the user wants to install on his laptop, check eg https://mcp.so/

maelp avatar Mar 19 '25 07:03 maelp

@marijnbent the easiest way for you to use custom Raycast actions would be to use my plugin, and just create a single plugin called "send-action" which says "send the query to Raycast" and output just a JSON of format "{ query: "string" }" and then the bash script should just send the raw query to Raycast for processing with MCP server (I guess it's possible in their app through "internal url" links)

maelp avatar Mar 19 '25 07:03 maelp

@Beingpax to summarize the discussion:

  1. I think my first PR is a good starting point (however, it can be improved by adding Swift libs for JSON Schema validation, and replace the JSON input by an optional JSON schema to parse the answer, also for models that provide it, give the JSONSchema for "constraining the output of the result")
  2. to integrate with Raycast, I think the easiest is what I suggested above, a single "handle with raycast" workflow which just forwards the request to raycast, so that raycast can use its actions, so you could say "Ask Raycast to check my email" it it would just do a handle-with-raycast action with the proper query, and hand it out there
  3. the "long-term solution" would probably be to look at proper MCP client/server handling, with a small "in-process" MCP server where people can add their bash script (similar to what I did, but more robust) so it's easy and convenient for new user, and a MCP client which allows to query any other MCP server (on your laptop or on the web) for power-users who want this

1 and 2 are very easy to do from my PR, just a bit of cleaning-up (tell me if I can help) 3 is more involved (but also much more powerful, turning VoiceInk into a full-fledged transcription + AI assistant)

maelp avatar Mar 19 '25 07:03 maelp