Add length-constrained segmentation with configurable priors and algo…
…rithms
Over time, this mutated a little into a more comprehensive v2.2.0 update! I expanded the core features from @harikesavan (adding language priors, lognormal prior, among others), and tried to make sure we are covering all edge cases, and text can be fully recovered after splitting (hence the very comprehensive test suite...). I also added some docs, and bumped Python to >= 3.9 since our dependencies require that anyway.
I thought a full changelog, so here it is! (auto-generated, though): @bminixhofer
Maybe also interesting to @harikesavan @igorsterner
📋 Changelog: wtpsplit v2.1.7 → v2.2.0
26 files changed, +3,122 / -101 lines
🎯 Major Feature: Length-Constrained Segmentation
Control segment lengths with min_length and max_length parameters.
New Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
min_length |
int |
1 |
Minimum length (best effort) |
max_length |
int | None |
None |
Maximum length (strict) |
prior_type |
str |
"uniform" |
Prior distribution |
prior_kwargs |
dict | None |
None |
Prior configuration |
algorithm |
str |
"viterbi" |
"viterbi" (optimal) or "greedy" (faster) |
Prior Functions
| Prior | Best For | Key Parameters |
|---|---|---|
"uniform" (default) |
Just enforce max_length | — |
"gaussian" |
Prefer target length | target_length, spread |
"lognormal" |
Natural distribution | target_length, spread (0.3-0.7) |
"clipped_polynomial" |
Hard enforcement | target_length, spread |
Language-Aware Defaults (70+ languages)
Automatic target_length/spread based on language:
- East Asian (zh, ja, ko): shorter (45-55 chars)
- Germanic (de, nl, en): medium-long (75-90 chars)
- Romance/Slavic (fr, es, ru): medium-long (78-85 chars)
# Auto-applies when using LoRA with language
sat = SaT("sat-3l", style_or_domain="ud", language="de")
sat.split(text, max_length=150, prior_type="gaussian") # German defaults
# Or explicit
sat.split(text, max_length=100, prior_type="gaussian", prior_kwargs={"lang_code": "zh"})
Text Reconstruction
# With constraints: "".join(segments) == original
# Without constraints: "\n".join(segments) == original
🆕 New Files
| File | Lines | Description |
|---|---|---|
wtpsplit/utils/constraints.py |
494 | Viterbi DP & greedy algorithms, constraint enforcement |
wtpsplit/utils/priors.py |
198 | Prior functions + 70+ language defaults |
test_length_constraints.py |
1,164 | 98 test cases |
length_constrained_segmentation_demo.py |
450 | Interactive demo |
docs/LENGTH_CONSTRAINTS.md |
288 | Math & implementation docs |
📝 Key Modifications
wtpsplit/__init__.py (+276 lines)
- New parameters for
WtP.split()/SaT.split() - Input validation, warnings when
thresholdignored withmax_length - Type hints fixed with
from __future__ import annotations
wtpsplit/utils/__init__.py
- Bug fix:
from cached_property import ...→from functools import cached_property
README.md (+84 lines)
- New section: "(NEW! v2.2+) Length-Constrained Segmentation"
🔧 Build & CI Changes
setup.py
python_requires=">=3.9" # NEW: drops 3.7, 3.8
"transformers>=4.22.2,<5.0" # Pinned (v5.0 breaking)
"huggingface-hub<1.0" # Pinned (HfFolder removed)
# Removed: "cached_property" (stdlib in 3.9+)
pyproject.toml
target-version:py38...→py39, py310, py311, py312[tool.ruff.per-file-ignores]→[tool.ruff.lint.per-file-ignores]
.github/workflows/python.yml
- Python matrix: removed 3.8, added 3.12
- Updated to
actions/checkout@v4,actions/setup-python@v5 ruff→ruff check --target-version=py39
requirements.txt
- Pinned
huggingface-hub==0.25.2 - Removed
cached_property
⚠️ Breaking Changes
| Change | Impact |
|---|---|
| Python ≥3.9 | Drops 3.7, 3.8 |
| transformers <5.0 | v5.0 has breaking API |
| huggingface-hub <1.0 | v1.0 removes HfFolder |
🧪 Tests
| File | Tests |
|---|---|
test.py |
+6 new constraint tests |
test_length_constraints.py |
98 tests (NEW) |
| Total | 130 tests ✅ |