chatterbox icon indicating copy to clipboard operation
chatterbox copied to clipboard

[Bug] Report - Sequential Short Text Generation Crash

Open diodiogod opened this issue 5 months ago • 2 comments

Bug Summary

ChatterBox TTS model crashes with CUDA tensor indexing errors when generating multiple short text segments in sequence. The bug is apperently triggered by insufficient text length combined with reference audio quality, leading to tensor dimension mismatches in sequential generations.

Environment

  • Model: ChatterBox TTS (ResembleAI/chatterbox)
  • Framework: PyTorch with CUDA
  • Platform: Windows (confirmed), ComfyUI, ComfyUI_ChatterBox_SRT_Voice
  • Hardware: CUDA-capable GPU

Bug Description

Error Message

repeated messages like this:

C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\cuda\Indexing.cu:1553: 
block: [X,0,0], thread: [Y,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Root Cause

? (LLM interpretation): ChatterBox model fails to generate sufficient internal tokens for short text, leading to mel-spectrogram and token alignment issues. When combined with sequential generation, this causes internal state contamination and tensor dimension inconsistencies.

Reproduction Factors

1. Text Length Dependency

  • Safe length: 30+ characters with padding
  • Unsafe length: <25 characters without padding
  • Critical factor: Token count, not just character count - ? (LLM interpretation)

2. Reference Audio Impact (??? not really sure, but changing auido source surely makes a difference)

  • Low-quality/short audio: More tolerant of short text ?
  • High-quality/long audio: More sensitive to short text ?
  • Speech vs music: Both affected, but music/complex audio fails faster ??

3. Sequential Generation Pattern

  • Single generation: Usually works even with very short text
  • 2-3 sequential: Crashes with short text (always happens if segment is small enough)
  • 4+ sequential: Even more consistency on small text. This is not random. After padding small texts generation works to up to 12 queues of generations of 5 small segmensts each.

Reproduction Steps

Minimal Test Case

  • Queue 4 or more generation with my Chatterbox SRT node with subtitles with small segments, or the normal Chatterbox node with many character changes with small segments:

Say what? NO? [female_01]THAT! [female_02]NO! Yes!

Technical Details

Internal State Issue - ? (LLM interpretation):

The bug suggests ChatterBox model:

  1. Doesn't properly reset internal tensors between generations
  2. Leaves corrupted mel-spectrogram/token alignment state after short text processing
  3. Fails tensor dimension validation in subsequent generations

Expected Behavior

Each call to model.generate() should be independent and not affect subsequent generations, regardless of text length.

Workaround

Pad short text segments with text to ensure sufficient token generation:Provides sufficient tokens for internal ChatterBox processing

  • Works reliably for sequential generation scenarios

My Temporary Workaround (ComfyUI Integration)

Note: This is a workaround for ComfyUI usage, not a core fix for ChatterBox itself.

The ComfyUI ChatterBox Voice extension implements crash prevention by predicting problematic text patterns and skipping them before they reach ChatterBox, preventing crashes in workflow environments. This does not solve the underlying ChatterBox bug.

Testing results:

  • Short text "word" → crashes in sequential generation
  • Spaces/periods padding onlt (probably removed on generation) → still crashes even with 150+ characters
  • Meaningful text like "word is a word is a world" → works reliably, long normal sequency of speech always work.
  • Natural hesitation prefix "...ummmmm {seg}" → works and preserves original text when long enough
  • Anecdotal Success Pattern: ",,{seg} hmm {seg},," → No crashes observed in extensive testing
    • Uses commas instead of spaces for punctuation
    • Includes natural hesitation sound ("hmm")
    • Provides repetition pattern that may help with tensor alignment
    • NOT IDEAL, but is the workaround I managed to make that does not interfear tooo much with generation.

Investigation of Proposed Fixes

Community Fix Analysis (GitHub Issue #72)

Proposed Solution: Remove MinPLogitsWarper and adjust sampling parameters (top_p, repetition_penalty) Status: ❌ Does NOT fix this specific CUDA crash Evidence: Testing with ChatterBox implementations that include these changes still produce identical tensor indexing crashes Conclusion: The logits processing fix addresses different generation issues, not the sequential short-text CUDA crashes

Current Status

As of July 2025:

  • Official ChatterBox repository: Bug remains unfixed
  • Community forks: Proposed fixes don't resolve this crash pattern
  • Root cause: Still unknown - requires investigation of tensor state management

Impact

This bug severely limits ChatterBox usage in:

  • Batch processing multiple short texts
  • Interactive applications with variable text lengths
  • Any system that processes multiple generations sequentially

Suggested Investigation

The issue likely exists in:

  • Model state reset logic between generations
  • Tensor dimension validation for short text inputs
  • Mel-spectrogram and token alignment cleanup
  • Internal cache/conditional management

diodiogod avatar Jul 19 '25 01:07 diodiogod

Yes, I've also been troubled by this short text processing issue.

lcl3809 avatar Jul 22 '25 01:07 lcl3809

im running into this issue too. it looks like https://github.com/OSCxSST/chatterbox-CTB/pull/4 addresses the issue, is there any status on getting that integrated here?

RetroZelda avatar Oct 11 '25 19:10 RetroZelda