[Feature] TTS: Support [pause:xx] Tags, Auto-Editor Cleanup, and Dependency Upgrade
Overview
This pull request adds pause tag support and audio artifact cleaning features to Chatterbox TTS, while maintaining full compatibility with the upstream multilingual implementation.
Status: ✅ Successfully rebased onto up/master (includes Multilingual v2 #295)
Key Features
1. Pause Tag Support ([pause:Xs])
Users can now insert pauses in generated audio using the [pause:Xs] syntax:
from chatterbox import ChatterboxTTS
tts = ChatterboxTTS()
audio = tts.generate(
text="Hello[pause:1.0s]world!",
ref_audio_path="reference.wav"
)
Implementation:
parse_pause_tags()function parses pause markers from text (tts.py:643)create_silence()generates silent audio segments (tts.py:690)- Automatic pause duration rounding to 0.1s increments
- Seamless integration with existing TTS generation pipeline
2. Auto-Editor Artifact Cleaning
Removes unwanted audio artifacts while preserving pause boundaries:
audio = tts.generate(
text="Your text here",
ref_audio_path="reference.wav",
use_auto_editor=True,
ae_threshold=0.06,
ae_margin=0.2
)
Implementation:
_clean_artifacts()method integrates auto-editor tool (tts.py:579)- Configurable threshold and margin parameters
- Protects pause boundaries during artifact removal
- Optional watermark removal support
3. Long Text Async Processing
Handles long text generation efficiently:
- Automatic text segmentation for texts > 300 characters
- Asynchronous batch processing with configurable workers
- Language-aware sentence splitting (EN, ZH, JA, KO)
- Smart sentence merging to avoid fragments
New utility functions in text_utils.py:
split_text_into_segments()- Intelligent text segmentationsplit_by_word_boundary()- Language-aware word boundary detectionmerge_short_sentences()- Combines short segmentsdetect_language()- Auto-detects text language
Compatibility with Upstream
This PR has been successfully rebased onto the latest upstream master, which includes:
✅ Multilingual v2 Update (#295) - 23 language support
✅ ChatterboxMultilingualTTS - New multilingual TTS class
✅ MTLTokenizer - Multilingual tokenization
✅ All upstream bug fixes and improvements
Both feature sets work together seamlessly:
- Pause tags work with all 23 supported languages
- Artifact cleaning compatible with multilingual audio
- Text utilities support multilingual text processing
Changes Summary
Modified Files
src/chatterbox/tts.py (+434 lines)
- Added
parse_pause_tags()function - Added
create_silence()function - Added
_clean_artifacts()method - Enhanced
generate()method with pause and artifact cleaning support - New parameters:
use_auto_editor,ae_threshold,ae_margin,disable_watermark,max_segment_length,max_workers
src/chatterbox/text_utils.py (NEW - 358 lines)
- Language detection for EN, ZH, JA, KO
- Text segmentation utilities
- Word boundary detection
- Sentence splitting and merging
src/chatterbox/__init__.py
- Exports both
ChatterboxTTSandChatterboxMultilingualTTS - Exports
SUPPORTED_LANGUAGES(23 languages) - Exports text utility functions
pyproject.toml
- Version:
0.1.4(matching upstream) - Python requirement:
>=3.10(matching upstream) - numpy:
>=1.24.0,<1.26.0(matching upstream) - Added dependencies:
auto-editor>=27.0.0(for artifact cleaning)resampy==0.4.3(for audio resampling)
- Preserved upstream dependencies:
- All multilingual dependencies (spacy-pkuseg, pykakasi, etc.)
- gradio, russian-text-stresser
README.md
- Documented pause tag usage
- Added artifact cleaning examples
- Preserved multilingual feature documentation
Testing
All features have been tested and verified:
✅ Python Syntax - All files compile successfully
✅ Pause Tag Parsing - Handles single/multiple/edge cases
✅ Multilingual Support - 23 languages correctly exported
✅ Text Utilities - All segmentation functions work
✅ Module Exports - All imports functional
✅ Dependencies - Correctly merged (32/32 tests passed)
Test Results: 100% pass rate (32/32 tests)
Usage Examples
Basic Pause Tags
from chatterbox import ChatterboxTTS
tts = ChatterboxTTS()
audio = tts.generate(
text="Welcome[pause:0.5s]to[pause:0.5s]Chatterbox",
ref_audio_path="speaker.wav"
)
With Artifact Cleaning
audio = tts.generate(
text="Your text with[pause:1.0s]natural pauses",
ref_audio_path="speaker.wav",
use_auto_editor=True,
ae_threshold=0.06
)
Long Text Processing
long_text = "..." # Text longer than 300 characters
audio = tts.generate(
text=long_text,
ref_audio_path="speaker.wav",
max_segment_length=300,
max_workers=4
)
Multilingual with Pause Tags
from chatterbox import ChatterboxMultilingualTTS
mtl_tts = ChatterboxMultilingualTTS()
audio = mtl_tts.generate(
text="Bonjour[pause:1.0s]le monde", # French with pause
language="fr",
ref_audio_path="french_speaker.wav"
)
Migration Notes
This PR maintains backward compatibility:
- Existing code using
ChatterboxTTScontinues to work unchanged - New parameters are optional with sensible defaults
- No breaking changes to the API
Acknowledgments
- Base implementation builds on Chatterbox by Resemble AI
- Successfully integrated with upstream Multilingual v2 features
- Preserves all upstream improvements and bug fixes
Checklist
- [x] Code rebased onto latest upstream master
- [x] All tests passing (32/32)
- [x] Pause functionality verified
- [x] Multilingual compatibility verified
- [x] Dependencies correctly merged
- [x] Documentation updated
- [x] No breaking changes
- [x] Backward compatible
The pause tag is a huge improvement and has made my workflow usable with Chatterbox. Thank you!
Hello, Are there any specific instructions or guides I can follow to update my chatterbox with this code? I need the pause tag capability badly.
Hello, Are there any specific instructions or guides I can follow to update my chatterbox with this code? I need the pause tag capability badly.
You can click the branch above (in this case, https://github.com/EasyMetaAu/chatterbox/tree/master), pull and build it. That's what I did.
@mylukin @feliscat Hi there! I used this branch but the model reads "pause" word instead of adding pause between words! Here is the code: (python 3.11)
git clone https://github.com/EasyMetaAu/chatterbox.git cd chatterbox pip install -e .
import torchaudio as ta from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device="cuda") text = "This is [pause:1.0] my test text." AUDIO_PROMPT_PATH = "audio_denoised.wav" wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH, cfg_weight=0.4, use_auto_editor=True) ta.save("out/audio_pause.wav", wav, model.sr)
This is my test text
Change to : This is [pause:1s] my test text
@mylukin @feliscat Hi there! I used this branch but the model reads "pause" word instead of adding pause between words! Here is the code: (python 3.11)
git clone https://github.com/EasyMetaAu/chatterbox.git cd chatterbox pip install -e .
import torchaudio as ta from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device="cuda") text = "This is my test text." AUDIO_PROMPT_PATH = "audio_denoised.wav" wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH, cfg_weight=0.4, use_auto_editor=True) ta.save("out/audio_pause.wav", wav, model.sr)
The correct format is [pause:Xs]
@mylukin @feliscat Thanks a lot! Is this feature available for the multilingual model?
Is there a reason why this PR isn't being merged? Of course there are conflicts right now that need resolving, but has it been reviewed by official contributors?
Commenting because I really this feature also, if possible to merge it. Thanks.
I just also wanted to second that this PR would be extremely useful 😄 I would love to see it merged!