Add High-Quality Text-to-Speech for Accessibility

Open tmm22 opened this issue 1 month ago • 0 comments

Add High-Quality Text-to-Speech for Accessibility

Problem Statement

Users with disabilities who rely on text-to-speech (TTS) face significant barriers when using macOS:

Apple's Built-in TTS Falls Short for Accessibility Needs

Critical Issues:

Unusable for Long Documents 📚
- Robotic, monotone voices become exhausting after 5-10 minutes
- Poor intonation makes comprehension difficult
- No emotional context or natural pacing
- Users with visual impairments or dyslexia need to listen for hours daily
Inconsistent Voice Quality 🎭
- Some voices are barely intelligible
- Quality varies dramatically between languages
- Many accents/dialects have poor representation
- No control over voice characteristics
Limited Customization 🎚️
- Can't adjust emotional tone, stability, or clarity
- Speed control is basic (no fine-tuning)
- No voice style options for different content types
- One-size-fits-all doesn't work for accessibility
Poor Language Support 🌍
- Non-English voices often sound worse
- Limited dialect variations
- Pronunciation issues with technical terms
- No customization for names or specialized vocabulary

The Impact on Disabled Users

Real-world consequences:

Students with dyslexia struggle to consume course materials
Users with visual impairments face fatigue from poor voice quality
People with ADHD can't focus due to monotone delivery
Non-native speakers can't rely on TTS for learning
Professionals can't use TTS for work documents (too exhausting)

Current "solutions" don't work:

Web-based TTS services: Require copying/pasting, no offline access, poor UX
Separate Mac apps: Force users to juggle multiple tools, break workflows
Mobile TTS apps: Not practical for desktop work, small screens
Browser extensions: Limited to web content, inconsistent quality

Why macOS Needs Better Native TTS Support

The gap:

macOS has excellent Speech-to-Text (dictation)
macOS has terrible Speech synthesis for accessibility
No native solution bridges this gap

What users need:

Natural-sounding voices comfortable for extended listening
Integration with existing workflows
Offline capability with online premium options
Control over voice characteristics and pacing
Support for custom pronunciation and terminology

Why VoiceInk is the Perfect Solution

VoiceInk already excels at Speech-to-Text (transcription via Whisper). Adding Text-to-Speech creates a complete accessibility suite that no other Mac app provides:

The Perfect Combination 🎯

Speech → Text (VoiceInk's existing strength)
    ↓
  ✨ NEW ✨
    ↓
Text → Speech (What this feature adds)

Bidirectional communication:

Voice to Text → Transcribe meetings, dictate notes, capture thoughts
Text to Voice → Listen to documents, hear your writing, consume content
Round-Trip Workflows → Record audio, transcribe, edit text, listen back with premium voices

No other Mac app does this. VoiceInk would become THE go-to accessibility tool.

Why This Belongs in VoiceInk

Natural fit:

✅ Already accessibility-focused - VoiceInk's mission aligns with disability support
✅ Complements existing features - Completes the "voice interface" story
✅ Same user base - People who need STT often need TTS
✅ Unified workflow - One app for all voice/text needs
✅ Local-first - Matches VoiceInk's privacy focus (offline option)

Strategic positioning:

🚀 Market differentiation - Unique feature combination
💪 Category leader - From "great transcription" to "complete accessibility suite"
🏆 Competitive moat - Hard for competitors to replicate this combo
💰 Value proposition - Justifies premium pricing with premium voices

Proposed Solution

Integrate high-quality Text-to-Speech using premium providers (ElevenLabs, OpenAI) with a free fallback to macOS voices.

Key Features

Core Functionality:

Multiple provider support: ElevenLabs, OpenAI, Google Cloud TTS, + built-in macOS
Simple workflow: Select voice → Enter text → Generate
Voice preview system before generating
Batch generation for long documents
Audio export (.m4a, .mp3, .wav)
Playback controls: Speed (0.5×-2×), looping, timeline scrubbing

Accessibility Features:

Works immediately with macOS voices (no API keys required)
Natural voices from ElevenLabs/OpenAI for comfort during long listening
Voice style controls (emotion, stability, clarity)
Pronunciation glossary for custom terms
Translation support (50+ languages)
Speed adjustment for different comprehension needs

Advanced Capabilities:

URL import: Extract and speak web articles
Text snippets library: Save commonly used phrases
Transcript generation: Export subtitles with timestamps
Cost estimation: Transparent pricing for cloud providers
Generation history: Replay previous outputs

Why Premium Voices Matter for Accessibility

ElevenLabs/OpenAI vs Apple TTS:

Feature	Apple TTS	Premium TTS (ElevenLabs/OpenAI)
Naturalness	❌ Robotic	✅ Human-like
Long listening	❌ Exhausting (10 min)	✅ Comfortable (hours)
Emotional range	❌ Monotone	✅ Expressive
Customization	❌ Minimal	✅ Extensive
Pronunciation	❌ Poor	✅ Excellent
Multiple voices	❌ Limited	✅ Hundreds

For accessibility, quality isn't a luxury—it's essential.

Real-World Use Cases

For Users with Visual Impairments

Current: Struggle with robotic macOS voices, fatigue quickly
With VoiceInk: Use premium voices for natural listening experience
Workflow: Import documents → Generate audio → Listen comfortably for hours

For Users with Dyslexia

Current: Reading long documents is exhausting and error-prone
With VoiceInk: Import web articles or PDFs → Listen with adjustable speed
Workflow: Copy text → Paste in VoiceInk → Generate → Listen while doing other tasks

For Students with Learning Disabilities

Current: Struggle to proofread written work by reading
With VoiceInk: Transcribe assignments by voice → Listen back to catch errors
Workflow: Dictate essay → Edit transcript → Hear it read back → Submit confidently

For Professionals Who Need Both

Current: Use separate apps for dictation and TTS (clunky)
With VoiceInk: One app for all voice/text needs
Workflow:
- Transcribe meeting notes
- Edit and clean up transcript
- Generate audio summary
- Share both text and audio with team

Why This is Urgent

The accessibility community needs this now:

Existing solutions are inadequate
- Apple TTS is painful for long-form content
- Web services are fragmented and require internet
- No unified Mac solution exists
Remote work/education increased TTS demand
- More digital content to consume
- More long-form documents
- More need for multimodal accessibility
AI voice quality is finally good enough
- ElevenLabs and OpenAI TTS are production-ready
- Natural enough for daily use
- Affordable for individual users
VoiceInk has the infrastructure
- Already handles audio processing
- Already has API integration patterns
- Already has the right user base

Implementation Available

A full implementation of this feature is ready for review:

Pull Request #354: Add Text-to-Speech as Accessibility Feature

What's included:

✅ 55 new Swift files (8,592 lines of code)
✅ Complete TTS workspace with modern UI
✅ Support for ElevenLabs, OpenAI, Google Cloud TTS, macOS voices
✅ Settings integration with secure API key storage
✅ Batch processing, audio export, playback controls
✅ Zero breaking changes (purely additive)
✅ Consistent with VoiceInk's design system
✅ Well-documented and tested

The work is done. This issue is to discuss whether to merge it.

Discussion Points

For the VoiceInk community to consider:

Does this align with VoiceInk's mission?
- Is completing the voice/text accessibility loop worthwhile?
- Should VoiceInk be "just transcription" or a "complete accessibility suite"?
Is the implementation acceptable?
- Review PR #354 for code quality, architecture, UX
- Are there concerns about maintenance burden?
- Does it integrate well with existing features?
What about scope creep?
- Is this "feature bloat" or natural evolution?
- Does it enhance or distract from core transcription?
- How do users feel about TTS in a transcription app?
Accessibility priority?
- How important is it to serve users with disabilities?
- Is solving the "Apple TTS problem" valuable to the community?
- Would this make VoiceInk more inclusive?

Alternatives Considered

Why not just tell users to use other TTS apps?

Workflow fragmentation - Forces context switching between apps
No integration - Can't leverage VoiceInk's existing transcripts
Poor UX - Separate apps don't understand each other
Cost - Premium TTS apps cost $20-50/month separately

Why not just improve Apple's TTS?

Not in our control - We can't fix Apple's voices
Slow progress - Apple TTS hasn't improved significantly in years
Immediate solution - Premium APIs available now

Why not just wait for Apple to fix it?

No indication they will - Apple TTS has been mediocre for a decade
Users need help now - People with disabilities can't wait
Competitive advantage - VoiceInk can lead where Apple lags

Call to Action

For maintainers: Please review PR #354 and consider merging this accessibility-focused feature.

For users: If you need better TTS on Mac, please comment with your use case. Your voice matters.

For accessibility advocates: Share this issue with communities who would benefit from better Mac TTS.

Add High-Quality Text-to-Speech for Accessibility

Add High-Quality Text-to-Speech for Accessibility

Problem Statement

Apple's Built-in TTS Falls Short for Accessibility Needs

The Impact on Disabled Users

Why macOS Needs Better Native TTS Support

Why VoiceInk is the Perfect Solution

The Perfect Combination 🎯

Why This Belongs in VoiceInk

Proposed Solution

Key Features

Why Premium Voices Matter for Accessibility

Real-World Use Cases

For Users with Visual Impairments

For Users with Dyslexia

For Students with Learning Disabilities

For Professionals Who Need Both

Why This is Urgent

Implementation Available

Discussion Points

Alternatives Considered

Call to Action

Related Links