text-generation-webui
text-generation-webui copied to clipboard
Continuous whisper
Description
A toggle in which, if ticked, would process the contents of what's been said after the user stops talking. Then, once the transcribed audio has been automatically submitted, begins listening again.
Additional Context
Ideally if combined with TTS would allow for a continuous audio conversation.
Using the whisper_stt extension.
It would be great if i did not have to scroll down and push the record and stop button when i want to talk.
It would be preferred if the record/stop record UI element could be right up the top with the other text options.
Ideally there would be no need to use the keyboard at all.
It would be great if i did not have to scroll down and push the record and stop button when i want to talk.
It would be preferred if the record/stop record UI element could be right up the top with the other text options.
Ideally there would be no need to use the keyboard at all.
I'm working on some javascript to make it seamless though I'm not the greatest coder. at the very least this should be able to help a little bit.
Change the audio.change() function in /extensions/whisper_stt/script.py to
audio.change( auto_transcribe, [audio, auto_submit], [shared.gradio['textbox'], audio]).then( None, auto_submit, None, _js="(check) => {if (check) { document.getElementById('Generate').click(); setTimeout(function() {document.getElementsByClassName('record-icon svelte-1thnwz')[0].click();}, 1000) }}")
it will save you 1 click by making it start recording after the submit.
This javascript should work in chrome, haven't tested other browsers. In combination with the edit to the script.py file gives a continuous verbal conversation with no keyboard input. I'd love to say that it works perfectly, but after a few minutes of working properly it stops... or my instance of textgen crashes from using too many extensions idk.
Copy this and put it in your console on the textgen page. How to package this better, idk. My attempts at messing with the python were failing me.
// get the microphone input stream navigator.mediaDevices.getUserMedia({audio: true}) .then(stream => { const audioContext = new AudioContext(); const source = audioContext.createMediaStreamSource(stream); const analyzer = audioContext.createAnalyser(); analyzer.fftSize = 2048; source.connect(analyzer);
const bufferLength = analyzer.frequencyBinCount;
const dataArray = new Uint8Array(bufferLength);
const MIN_DB = -70.0; // adjust this to set the minimum dB level
const MAX_DB = -0.0; // adjust this to set the maximum dB level
const AVERAGE_WINDOW_SIZE = 5000; // adjust this to set the window size for the average calculation
const ACTIVE_FRAMES_THRESHOLD = 80; // adjust this to set the number of consecutive active frames needed
const AVERAGE_THRESHOLD_DB = -10.0; // adjust this to set the dB threshold for average noise
let activeFrames = 0;
let alertShown = false;
let averageDb = null;
let averageCount = 0;
let averageSum = 0;
let thresholdDb = MIN_DB;
setInterval(() => {
analyzer.getByteFrequencyData(dataArray);
const rms = Math.sqrt(
dataArray.reduce((acc, value) => acc + value * value, 0) / bufferLength
);
if (rms > 0) {
const db = 20 * Math.log10(rms);
const normalizedDb = (db - MIN_DB) / (MAX_DB - MIN_DB);
// calculate the average microphone level over a certain window size
averageCount++;
averageSum += normalizedDb;
if (averageCount === Math.floor(AVERAGE_WINDOW_SIZE / 100)) {
if (averageCount > 0) {
averageDb = averageSum / averageCount;
thresholdDb = Math.max(averageDb * (MAX_DB - MIN_DB) + MIN_DB, AVERAGE_THRESHOLD_DB);
console.log('Average microphone level:', averageDb);
}
averageCount = 0;
averageSum = 0;
}
// check if the dB level is above the threshold
const isActive = normalizedDb >= (thresholdDb - MIN_DB) / (MAX_DB - MIN_DB);
if (isActive) {
activeFrames++;
} else {
activeFrames = Math.max(activeFrames - 1, 0);
}
// determine if the microphone is active
const isSpeaking = activeFrames > ACTIVE_FRAMES_THRESHOLD;
console.log('Microphone is', isSpeaking ? 'active' : 'inactive');
// show an alert window when the microphone becomes inactive
if (!isSpeaking && !alertShown && activeFrames > 0) {
alertShown = true;
document.getElementsByClassName('record-icon svelte-1thnwz')[0].click();
} else if (isSpeaking) {
alertShown = false;
}
}
}, 100);
}) .catch(error => { console.error('Error accessing microphone:', error); });
This code works much much better. Still need to edit the script.py in the whisper_stt extension folder to click the button after submitting though.
`// check if the browser supports the Web Speech API if ('webkitSpeechRecognition' in window) {
// create a new instance of the speech recognition object
const recognition = new webkitSpeechRecognition();
// set the properties for the recognition object
recognition.continuous = true;
recognition.interimResults = false;
// when the user starts speaking
recognition.onstart = () => {
console.log('Speech recognition started');
};
// when the user stops speaking
recognition.onresult = (event) => {
// get the last transcript
const lastTranscript = event.results[event.results.length - 1][0].transcript;
console.log('Last transcript: ' + lastTranscript);
// get the peak db value of the last transcript
document.getElementsByClassName('record-icon svelte-1thnwz')[0].click();
};
// when an error occurs
recognition.onerror = (event) => {
console.error('Speech recognition error:', event.error);
// restart the recognition process
recognition.stop();
recognition.start();
};
// when the recognition process ends
recognition.onend = () => {
console.log('Speech recognition ended');
// restart the recognition process
recognition.start();
};
// start the recognition process
recognition.start();
} else { console.log('Web Speech API is not supported'); } `
Not sure how to gracefully add this to the code though as it will constantly try to press the record button each time you stop speaking after you run it in your console. The whisper_stt auto checkbox lacks an id and shares classnames with all the other ones. Not worth finding it in the array since the order could impact that with diff people's setups
I came here to post my interest in a hands-free solution. Hopefully something like this gets incorporated so we can autosubmit after a defined length of silence or on a command phrase.
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.
I am also here saying wtf lets goooo
I have no idea what I'm doing but if it helps this is the advice I got from chatgpt4 after some research prompts.
Again keep in mind I have no idea what I am talking about lol. What a time to be alive with AI lol.
Starting such a project involves several components, each of which can be quite complex. Below is a high-level guide to help you get started.
- Define Requirements and Design System Architecture:
Determine the specific functionalities you want to include. Decide on the languages and frameworks you’ll use for the server and client sides. Plan out the flow of data from the client's microphone to the server and back to the speaker. 2. Set Up Development Environment:
Install necessary languages (Python, JavaScript, etc.). Set up a version control system like Git. Decide on an IDE or code editor (e.g., VS Code). 3. Create a Basic Web Interface:
Use HTML/CSS/JavaScript to create a simple interface with a record button. Implement WebRTC to capture microphone input in real-time (you may need to look into libraries like RecordRTC or MediaStream Recording API). 4. Establish a Backend Service:
Use Flask or FastAPI to create a basic API. Write endpoints that can receive audio streams and return text or audio responses. 5. Integrate Voice Activity Detection (VAD):
Integrate py-webrtcvad or similar in your backend to detect when someone is speaking. 6. Integrate Speech-to-Text (STT):
Choose a STT service like Whisper, and integrate it to transcribe speech to text. 7. Implement Activation Keyword Detection:
Use a tool like pocketsphinx for keyword spotting to initiate the recording. 8. Connect to Text Generation API:
Use the transcribed text and send it to your text generation API (like OpenAI's GPT). 9. Integrate Text-to-Speech (TTS):
Choose a TTS tool and implement it to convert the generated text responses into speech. 10. Real-time Communication:
Implement WebSockets for real-time communication between your web interface and the server. 11. Testing:
Test each component individually and then test the system as a whole. Ensure that you handle errors and edge cases. 12. Security and Privacy:
Implement SSL to secure your API. Ensure you comply with data protection regulations. 13. Deployment:
Deploy your application on a server or cloud platform. Set up domain names and ensure your API can handle the expected load. 14. Documentation and Maintenance:
Document your API endpoints and the overall system architecture. Set up logging and monitoring for your application. 15. Iteration and Feedback:
Gather user feedback. Iterate on the product based on feedback and performance. Each step involves learning and implementation challenges, so start with what you know and gradually tackle each part.
For example, you could begin by setting up a basic Flask server and a simple HTML page that captures audio. Then you'd move on to sending this audio to the Flask server, and so on, building up complexity as you go.
Would you like to go into detail on any of these steps, or do you need information on resources or tutorials for any of the technologies mentioned?
i would like to have this feature aswell
Yes
On Mon, Mar 4, 2024 at 7:57 PM Dylan @.***> wrote:
i would like to have this feature aswell
— Reply to this email directly, view it on GitHub https://github.com/oobabooga/text-generation-webui/issues/1677#issuecomment-1977751008, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCPF6FIAL5LMHKS6BFERVJLYWUKBFAVCNFSM6AAAAAAXQZBOPGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZXG42TCMBQHA . You are receiving this because you commented.Message ID: @.***>
So i spent about 4 hours trying to modify the "record from microphone" button toggle behavior on the whisper-stt extension. Which apparantly has limitations though because of how gradio works.
I even tried to have it break audio into a segments after a 4 second pause detected in realtime after press the button once, and then upload it to the chat and send it automaticalpy but it didn't work. So gpt4 finally gave an idea that might work below but I don't know for sure.
It's idea made me think to try dragon naturally speaking which actually worked. I talk, it fills in the chat box and auto sends it. I setup a custom command in dragon to just press enter when I say the word "type".
Trying to figure out a way to do this without having to say type at all, like detecting 4 seconds of silence in dragonafter speaking it presses enter or something. I feel like I'm getting closer. It's not too bad though there's just a dragon bar at the top and you just arm the mic.
Here was the original gpt4 idea which I thought could be interesting, somehow getting past the gradio "record from microphone" button toggle estrictions by setting whisper to be used globally. But could also be hallucinating and bad idea. It was really frustrating when it would forget the goal during the long back and forth I had with it, and would make changes I never requested for code or leave stuff out, but this was still somewhat useful:
`If the submission to the 'Send a message' box is already working and you wish to integrate it with ui_chat.py, considering earlier discussions on using Whisper globally, here's a potential approach:`
`Global Whisper Instance: You could initiate a global instance of the Whisper model that's always listening or activated on command. This would involve creating a separate thread that handles audio input and calls the Whisper model.`
`Integration with ui_chat.py: To integrate this global Whisper instance into ui_chat.py, you would need to have a way to pass the transcribed text from the global Whisper thread to the Gradio interface. One approach is to use shared state variables or queues that both the Whisper thread and Gradio interface can access.`
`Auto-fill and Submit: Once the Whisper model transcribes the speech, you can have the shared state updated with the transcribed text. Then, the Gradio interface can poll this shared state and auto-fill the 'Send a message' input box with the transcribed text and simulate the submission (a button click or an Enter keypress event).`
`Here is a pseudocode outline based on the structure of your ui_chat.py and Whisper script:`
​
`python`
​
`# ... existing imports and code ...`
​
`# You might need to import additional threading or queueing modules`
`import threading`
`from queue import Queue`
​
`# Initialize a queue to hold transcribed messages`
`transcription_queue = Queue()`
​
`# Define a global listening function`
`def global_whisper_listener():`
`# Set up Whisper model and start listening`
`# Process audio and transcribe`
`# Put the transcribed text into the queue`
`pass`
​
`# Start the global Whisper listener in a separate thread`
`listener_thread = threading.Thread(target=global_whisper_listener)`
`listener_thread.start()`
​
`# ... existing ui_chat.py code ...`
​
`# Inside the Gradio interface event loop or a periodic update function`
`def update_chat_input():`
`if not transcription_queue.empty():`
`# Get the latest transcribed text`
`transcribed_text = transcription_queue.get()`
`# Update the Gradio chat input state with the transcribed text`
`shared.gradio['textbox'].update(value=transcribed_text)`
`# Here you would also need JavaScript to submit the text`
`# You can trigger JS from Python in Gradio using the _js parameter`
​
`# ... existing ui_chat.py code ...`
​
`# You would need to call update_chat_input() periodically,`
`# or integrate it into Gradio's event handling if it supports periodic callbacks.`
​
`In this pseudocode, the global_whisper_listener function would handle the continuous audio input and transcription using Whisper. It runs in a separate thread not to block the main thread running the Gradio UI. Once it transcribes speech to text, it places the text into transcription_queue, which the Gradio interface polls.`
`You'd integrate update_chat_input() into the Gradio interface event loop or as a periodic callback if Gradio supports such a feature. This function checks if there's new transcribed text in the queue, and if so, it updates the chat input and simulates the submission.`
`This approach would avoid direct interactions with the 'Record from microphone' button and instead directly fill the text into the chat input box. The specific details of the threading and queueing, as well as the exact way to trigger JavaScript from Python within Gradio, would depend on Gradio's capabilities and how it's set up in your application. It might require reading through Gradio's documentation or source code, or testing in your development environment.`