[Bug] Report - Sequential Short Text Generation Crash
Bug Summary
ChatterBox TTS model crashes with CUDA tensor indexing errors when generating multiple short text segments in sequence. The bug is apperently triggered by insufficient text length combined with reference audio quality, leading to tensor dimension mismatches in sequential generations.
Environment
- Model: ChatterBox TTS (ResembleAI/chatterbox)
- Framework: PyTorch with CUDA
- Platform: Windows (confirmed), ComfyUI, ComfyUI_ChatterBox_SRT_Voice
- Hardware: CUDA-capable GPU
Bug Description
Error Message
repeated messages like this:
C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\cuda\Indexing.cu:1553:
block: [X,0,0], thread: [Y,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Root Cause
? (LLM interpretation): ChatterBox model fails to generate sufficient internal tokens for short text, leading to mel-spectrogram and token alignment issues. When combined with sequential generation, this causes internal state contamination and tensor dimension inconsistencies.
Reproduction Factors
1. Text Length Dependency
- Safe length: 30+ characters with padding
- Unsafe length: <25 characters without padding
- Critical factor: Token count, not just character count - ? (LLM interpretation)
2. Reference Audio Impact (??? not really sure, but changing auido source surely makes a difference)
- Low-quality/short audio: More tolerant of short text ?
- High-quality/long audio: More sensitive to short text ?
- Speech vs music: Both affected, but music/complex audio fails faster ??
3. Sequential Generation Pattern
- Single generation: Usually works even with very short text
- 2-3 sequential: Crashes with short text (always happens if segment is small enough)
- 4+ sequential: Even more consistency on small text. This is not random. After padding small texts generation works to up to 12 queues of generations of 5 small segmensts each.
Reproduction Steps
Minimal Test Case
- Queue 4 or more generation with my Chatterbox SRT node with subtitles with small segments, or the normal Chatterbox node with many character changes with small segments:
Say what? NO? [female_01]THAT! [female_02]NO! Yes!
Technical Details
Internal State Issue - ? (LLM interpretation):
The bug suggests ChatterBox model:
- Doesn't properly reset internal tensors between generations
- Leaves corrupted mel-spectrogram/token alignment state after short text processing
- Fails tensor dimension validation in subsequent generations
Expected Behavior
Each call to model.generate() should be independent and not affect subsequent generations, regardless of text length.
Workaround
Pad short text segments with text to ensure sufficient token generation:Provides sufficient tokens for internal ChatterBox processing
- Works reliably for sequential generation scenarios
My Temporary Workaround (ComfyUI Integration)
Note: This is a workaround for ComfyUI usage, not a core fix for ChatterBox itself.
The ComfyUI ChatterBox Voice extension implements crash prevention by predicting problematic text patterns and skipping them before they reach ChatterBox, preventing crashes in workflow environments. This does not solve the underlying ChatterBox bug.
Testing results:
- Short text "word" → crashes in sequential generation
- Spaces/periods padding onlt (probably removed on generation) → still crashes even with 150+ characters
- Meaningful text like "word is a word is a world" → works reliably, long normal sequency of speech always work.
- Natural hesitation prefix "...ummmmm {seg}" → works and preserves original text when long enough
- Anecdotal Success Pattern:
",,{seg} hmm {seg},,"→ No crashes observed in extensive testing- Uses commas instead of spaces for punctuation
- Includes natural hesitation sound ("hmm")
- Provides repetition pattern that may help with tensor alignment
- NOT IDEAL, but is the workaround I managed to make that does not interfear tooo much with generation.
Investigation of Proposed Fixes
Community Fix Analysis (GitHub Issue #72)
Proposed Solution: Remove MinPLogitsWarper and adjust sampling parameters (top_p, repetition_penalty)
Status: ❌ Does NOT fix this specific CUDA crash
Evidence: Testing with ChatterBox implementations that include these changes still produce identical tensor indexing crashes
Conclusion: The logits processing fix addresses different generation issues, not the sequential short-text CUDA crashes
Current Status
As of July 2025:
- ❌ Official ChatterBox repository: Bug remains unfixed
- ❌ Community forks: Proposed fixes don't resolve this crash pattern
- ❌ Root cause: Still unknown - requires investigation of tensor state management
Impact
This bug severely limits ChatterBox usage in:
- Batch processing multiple short texts
- Interactive applications with variable text lengths
- Any system that processes multiple generations sequentially
Suggested Investigation
The issue likely exists in:
- Model state reset logic between generations
- Tensor dimension validation for short text inputs
- Mel-spectrogram and token alignment cleanup
- Internal cache/conditional management
Yes, I've also been troubled by this short text processing issue.
im running into this issue too. it looks like https://github.com/OSCxSST/chatterbox-CTB/pull/4 addresses the issue, is there any status on getting that integrated here?