VoiceInk icon indicating copy to clipboard operation
VoiceInk copied to clipboard

Add Text-to-Speech as Accessibility Feature 🎯

Open tmm22 opened this issue 1 month ago β€’ 1 comments

Add Text-to-Speech as Accessibility Feature 🎯

Overview

This PR adds a comprehensive Text-to-Speech (TTS) workspace to VoiceInk, positioned as a disability accessibility tool for users who need high-quality speech synthesis. This directly addresses the significant limitations of Apple's built-in TTS system.

⚑ Important: This is a purely additive feature. Zero changes to existing functionality. We're not modifying, replacing, or removing anythingβ€”just adding a powerful new dimension that makes VoiceInk a complete accessibility suite.

Why This Makes VoiceInk a "Slam Dunk" App πŸ€

VoiceInk already excels at Speech-to-Text (transcription). Now, with Text-to-Speech, it becomes a complete bidirectional communication tool:

The Perfect Combination

Speech β†’ Text (What VoiceInk already does brilliantly)
    ↓
  ✨ NEW ✨
    ↓
Text β†’ Speech (What this PR adds)

This creates an unmatched accessibility suite:

  1. Voice to Text β†’ Transcribe meetings, dictate notes, capture thoughts
  2. Text to Voice β†’ Listen to documents, hear your writing, consume content hands-free
  3. Round-Trip Workflow β†’ Record audio, transcribe it, edit the text, then listen back with premium voices

No other app combines these capabilities at this quality level. This positions VoiceInk as THE go-to accessibility tool for users who need both transcription AND speech synthesis.

The Problem: Apple's TTS Limitations

Many users with visual impairments, reading disabilities (dyslexia), learning differences, or other accessibility needs rely on text-to-speech as an essential daily tool. Unfortunately, Apple's built-in TTS has critical issues:

  1. Inconsistent Voice Quality 🎭

    • Some voices sound robotic and unnatural
    • Quality varies dramatically between languages
    • Many voices are uncomfortable for extended listening
  2. Limited Expressiveness πŸ“’

    • Flat, monotone intonation
    • No emotional context or emphasis
    • Makes long-form content exhausting to consume
  3. No User Control 🎚️

    • Can't fine-tune voice characteristics
    • No adjustments for speed, tone, or style
    • One-size-fits-all approach doesn't work for accessibility
  4. Language Support Issues 🌍

    • Poor quality for many non-English voices
    • Limited dialect support
    • Inconsistent pronunciation

The Solution: Premium Voice Integration

By integrating ElevenLabs and OpenAI's TTS, VoiceInk now provides:

  • βœ… Natural-sounding voices comfortable for hours of listening
  • βœ… Expressive speech with proper intonation and emotional context
  • βœ… Customization options for voice style, speed, and delivery
  • βœ… Consistent quality across all content types and languages
  • βœ… Fallback to macOS voices when no API key is provided

This makes VoiceInk a perfect combination: powerful transcription + high-quality speech synthesis, all in one accessibility-focused application.


Features Added

🎯 Core Functionality

  • Multiple Provider Support: ElevenLabs, OpenAI, Google Cloud TTS, + built-in macOS
  • Simple 3-Step Workflow:
    1. Select a voice
    2. Enter or paste text
    3. Click Generate
  • Voice Preview System: Listen to samples before generating
  • Batch Generation: Process multiple segments separated by ---
  • Audio Export: Save as .m4a, .mp3, or .wav
  • Playback Controls: Speed (0.5Γ—-2Γ—), looping, timeline scrubbing

β™Ώ Accessibility Features

  • No Feature Gates: Available to ALL users immediately
  • Fallback Mode: Works with macOS voices if no API keys configured
  • Clear Cost Display: Transparent pricing for cloud providers
  • Generation History: Replay previous outputs
  • Text Snippets: Save and reuse commonly spoken phrases
  • Pronunciation Glossary: Custom rules for names/technical terms

πŸ› οΈ Advanced Capabilities

  • Translation Support: Convert text to 50+ languages before speaking
  • URL Import: Extract and speak content from web articles
  • Smart Article Processing: AI-powered cleanup for readable content
  • Transcript Generation: Export subtitles (SRT/VTT) with timestamps
  • Voice Style Controls: Adjust emotion, stability, and clarity (ElevenLabs)

Technical Implementation

File Structure

VoiceInk/TTS/
β”œβ”€β”€ Models/          (13 files) - Data structures
β”œβ”€β”€ Services/        (17 files) - Provider integrations
β”œβ”€β”€ Utilities/       (11 files) - Helper functions
β”œβ”€β”€ ViewModels/      (1 file)  - Main TTS logic
└── Views/           (13 files) - UI components

Total: 55 new Swift files, 8,592 lines of code

Key Technical Features

  • βœ… Secure Storage: API keys stored in macOS Keychain
  • βœ… Modular Architecture: Each provider is a separate, testable service
  • βœ… Cost Estimation: Real-time calculation with character limits
  • βœ… AVFoundation Integration: Native audio playback and export
  • βœ… UI Consistency: Matches VoiceInk's design system (6px corners, CardBackground)
  • βœ… Memory Efficient: Streams large audio files
  • βœ… Error Handling: Comprehensive error messages and recovery

Integration Points

  • ContentView: Added "Text to Speech" tab (no feature gate)
  • SettingsView: New "Text-to-Speech" section for API configuration
  • Navigation: Fully integrated with app's notification system
  • Styling: Consistent with existing CardBackground and StyleConstants

Why This Matters

For Users with Disabilities

This feature transforms VoiceInk into a comprehensive accessibility suite:

  1. Speech-to-Text (Whisper) β†’ Type with your voice
  2. Text-to-Speech (ElevenLabs/OpenAI) β†’ Listen to any text
  3. AI Enhancement β†’ Process and refine content
  4. Power Modes β†’ Context-aware automation

Users can now:

  • Transcribe spoken notes, then listen back with natural voices
  • Import web articles, then have them read aloud expressively
  • Write emails, then hear them before sending
  • Learn from documents by listening instead of reading

For the VoiceInk Community

  • 🎯 Differentiator: Unique feature combination not found elsewhere
    • Whisper + ElevenLabs in ONE app? That's a slam dunk.
  • β™Ώ Accessibility Focus: Positions VoiceInk as THE inclusive, disability-friendly tool
    • Not just "supports accessibility"β€”BUILT for accessibility
  • πŸ’ͺ Complete Solution: Solves the full communication loop
    • No more switching between apps
  • πŸš€ User Retention: More reasons to keep VoiceInk as daily driver
    • Users stay because it does EVERYTHING they need
  • πŸ’° Growth Potential: Optional cloud providers (free start with macOS voices)
    • Free tier gets users hooked, premium features drive revenue
  • πŸ† Market Position: From "great transcription tool" to "complete accessibility suite"
    • This is the kind of feature that gets featured on accessibility blogs

This doesn't just improve VoiceInkβ€”it transforms it into a category leader.


Usage Example

Basic Workflow

1. Open VoiceInk β†’ "Text to Speech" tab
2. Select provider: "Mac OS" (free) or "ElevenLabs" (API key)
3. Choose voice from dropdown
4. Paste text or click "Add Content" β†’ "URL Import"
5. Click "Generate" (βŒ˜β†΅)
6. Audio plays automatically β†’ Export if needed

Advanced Workflow

1. Configure API keys: Settings β†’ Text-to-Speech
2. Create text snippets: Save common phrases
3. Add pronunciation rules: Custom names/terms
4. Use batch mode: Separate segments with ---
5. Adjust voice style: Emotion, stability (ElevenLabs)
6. Export with timestamps: SRT/VTT subtitles

Testing Checklist

  • [x] TTS workspace renders correctly
  • [x] Settings section appears in main settings
  • [x] Navigation from overflow menu works
  • [x] Mac OS (built-in) provider functions without API keys
  • [x] API key storage/retrieval from Keychain
  • [x] Voice selection and preview
  • [x] Text generation with cost estimation
  • [x] Audio playback controls (play/pause/speed/loop)
  • [x] Export functionality (.m4a, .mp3, .wav)
  • [x] UI consistency with rest of app (6px corners)
  • [x] No feature gates (available to all users)

Impact on Existing Features

βœ… Zero Breaking Changes

This PR does not touch any existing VoiceInk functionality:

  • ❌ No modifications to transcription engine or Whisper integration
  • ❌ No changes to AI Enhancement or existing workflows
  • ❌ No alterations to Power Modes, shortcuts, or settings (except adding TTS section)
  • ❌ No removal of any existing features or capabilities
  • ❌ No performance impact on existing transcription workflows

βœ… Purely Additive

What we're adding:

  • βž• New sidebar tab: "Text to Speech" (sits alongside existing tabs)
  • βž• New settings section: "Text-to-Speech" (separate from other settings)
  • βž• New TTS directory: 55 self-contained files under VoiceInk/TTS/
  • βž• New navigation route: Doesn't interfere with existing routing

Think of it as a plugin: Completely modular, self-contained functionality that enhances VoiceInk without touching its core. If you never click the "Text to Speech" tab, your VoiceInk experience is unchanged.


Why This Makes VoiceInk Unstoppable πŸš€

Before This PR

VoiceInk was already an excellent transcription tool:

  • Best-in-class local Whisper integration
  • Power Modes for context-aware workflows
  • AI enhancement for transcripts
  • Strong accessibility focus

After This PR

VoiceInk becomes a complete accessibility ecosystem:

User Need VoiceInk Solution
"I need to write without typing" βœ… Speech-to-Text transcription
"I need to listen instead of reading" βœ… Text-to-Speech synthesis
"I need both in one place" βœ… Perfect combination
"I need it to work offline" βœ… Local Whisper + macOS voices
"I need premium quality voices" βœ… ElevenLabs/OpenAI integration
"I need it to be affordable" βœ… Free tier with macOS, optional cloud

This is what users mean when they call something a "slam dunk app": It solves the complete problem, not just half of it.


Real-World Use Cases Enabled

For Users with Visual Impairments

  • Before: Use system TTS with poor voice quality
  • After: Use VoiceInk's premium voices for natural listening
  • Round-trip: Dictate notes β†’ Edit transcript β†’ Listen back

For Users with Dyslexia

  • Before: Struggle to read long documents
  • After: Import web article β†’ Generate high-quality audio β†’ Listen comfortably
  • Bonus: Adjust speed, replay sections, export for later

For Students with Learning Disabilities

  • Before: Can write but struggle to proofread by reading
  • After: Transcribe assignment β†’ Hear it read back β†’ Catch errors naturally
  • Workflow: Write by voice β†’ Edit visually β†’ Review by ear

For Everyone

  • Commute: Transcribe articles on computer β†’ Export audio β†’ Listen on phone
  • Multitasking: Listen to documents while doing other tasks
  • Accessibility: Options for different learning/processing styles

The magic is in the combination. Other apps do one or the other. VoiceInk now does both, seamlessly.


Future Enhancements

Potential follow-ups (not in this PR):

  • [ ] Offline voice downloading for better macOS TTS
  • [ ] SSML support for advanced pronunciation control
  • [ ] Real-time streaming for faster feedback
  • [ ] Voice cloning (ElevenLabs Professional tier)
  • [ ] Multi-speaker dialogue mode
  • [ ] Audiobook export with chapters

Credits

This feature was developed as part of the VoiceInk Community fork by @tmm22, with the goal of making VoiceInk a comprehensive accessibility tool. Special thanks to:

  • ElevenLabs for natural voice synthesis API
  • OpenAI for TTS and translation capabilities
  • VoiceInk community for feature requests and feedback

Checklist

  • [x] Code follows project style guidelines
  • [x] All new code is properly commented
  • [x] UI matches existing design system
  • [x] No breaking changes to existing features
  • [x] Feature works without API keys (macOS fallback)
  • [x] API keys stored securely in Keychain
  • [x] Comprehensive commit message included
  • [x] Ready for review

Final Thoughts

This PR represents months of development distilled into a clean, additive feature that:

  1. Respects VoiceInk's existing excellence by not touching core functionality
  2. Completes the accessibility story by adding the missing piece
  3. Creates a moat that competitors can't easily replicate
  4. Empowers users with disabilities to do more with less friction
  5. Positions VoiceInk as a complete solution, not just a transcription tool

This is how you build a "slam dunk app": Take what's already great, and add the one thing that makes it complete. No compromises, no trade-offsβ€”just pure addition of value.

The Vision

VoiceInk isn't just about transcription anymore.
It's about accessibility in its fullest sense:
  - Speak and be understood (Speech-to-Text)
  - Read and be heard (Text-to-Speech)
  - Choose your interface (Voice, Text, or Audio)
  
That's the slam dunk. That's the perfect combination.

Ready to merge πŸš€

This brings essential accessibility features to all VoiceInk users without changing a single line of existing transcription code. It's the best kind of feature addition: pure upside, zero risk.


How to Test

  1. Pull the branch: add-tts-accessibility-feature
  2. Launch VoiceInk: Everything works exactly as before
  3. Click "Text to Speech" tab: New workspace appears
  4. Generate some audio: Works immediately with macOS voices (no setup)
  5. Go back to transcription: Unchanged, perfect, exactly as you remember

See? Additive. Complete. Perfect.


Summary by cubic

Adds a new Text-to-Speech workspace with multiple providers (including macOS) for natural speech, plus previews, batch mode, export, and playback; purely additive via a new β€œText to Speech” tab and settings. Also includes security hardening (production-safe logging, ephemeral sessions, API key validation) and a comprehensive security audit.

  • New Features

    • Multiple providers with macOS fallback (no keys needed to start).
    • Voice preview and provider-specific style controls.
    • Batch generation using --- separators.
    • Export to common audio formats; speed, loop, and scrub playback.
    • URL import with article cleanup; translation to 50+ languages.
    • Cost estimation and generation history.
    • Text snippets and a pronunciation glossary.
    • Optional SRT/VTT transcript generation.
  • Migration

    • No breaking changes.
    • New β€œText to Speech” tab in the sidebar.
    • New Settings > Text-to-Speech for API keys (stored in Keychain).
    • Works out of the box with macOS voices; cloud providers are optional.

Written for commit f8ff67935721d1d25a2234abf66d018edf8d7034. Summary will update automatically on new commits.

tmm22 avatar Nov 01 '25 05:11 tmm22