Prince Canuma

Results 63 issues of Prince Canuma

The roadmap covers both approaches you mentioned: - End-to-End Speech-to-Speech Models: A direct approach using dedicated STS architectures like Moshi. - Modular Voice Pipeline: A composable approach combining Speech-to-Text, LLM...

Should we invert the real-time factor? If so, why? The source, states that values lower than 1 are better and above 1 are worse. Inverting it will cause rtf to...

https://huggingface.co/stabilityai/stable-audio-open-1.0

This PR adds NVIDIA's Parakeet. - [ ] Add streaming support - [ ] Refactor STT API (common functions) - [ ] Add Model harness with WER score and tests...

We need to refactor our Speech-to-Text (STT) API to extract common functions, eliminate code duplication, and create a more consistent interface across different components. ## Requirements - Identify and extract...

## Description We need to implement a model harness for evaluating Speech-to-Text (STT) models that calculates Word Error Rate (WER) as the primary performance metric, along with comprehensive tests. ##...

1. Sliding‑window (token‑wise) trimming Keep only the most‑recent N tokens in each layer’s key/value tensors. MAX_CACHE_TOKENS = 256 # ≃ 12 s of speech for tiny/small def trim_past_kv(past_kv, keep=MAX_CACHE_TOKENS): """past_kv...

Add first-class support for real-time transcription consisting of 1. Audio I/O utilities (load_audio, load_audio_chunk) 2. Streaming / buffer management (OnlineASRProcessor or equivalent) 3. Voice-Activity Detection integration (VACOnlineASRProcessor) The goal is...

Most models now have custom sampling rate so we are deprecating the sampling-rate args for hard-coded ones provided by the creators. You can access the sampling rate of any model...