wtpsplit icon indicating copy to clipboard operation
wtpsplit copied to clipboard

Add length-constrained segmentation with configurable priors and algo…

Open harikesavan opened this issue 1 month ago • 1 comments

…rithms

harikesavan avatar Nov 25 '25 10:11 harikesavan

Over time, this mutated a little into a more comprehensive v2.2.0 update! I expanded the core features from @harikesavan (adding language priors, lognormal prior, among others), and tried to make sure we are covering all edge cases, and text can be fully recovered after splitting (hence the very comprehensive test suite...). I also added some docs, and bumped Python to >= 3.9 since our dependencies require that anyway.

I thought a full changelog, so here it is! (auto-generated, though): @bminixhofer

Maybe also interesting to @harikesavan @igorsterner


📋 Changelog: wtpsplit v2.1.7 → v2.2.0

26 files changed, +3,122 / -101 lines


🎯 Major Feature: Length-Constrained Segmentation

Control segment lengths with min_length and max_length parameters.

New Parameters

Parameter Type Default Description
min_length int 1 Minimum length (best effort)
max_length int | None None Maximum length (strict)
prior_type str "uniform" Prior distribution
prior_kwargs dict | None None Prior configuration
algorithm str "viterbi" "viterbi" (optimal) or "greedy" (faster)

Prior Functions

Prior Best For Key Parameters
"uniform" (default) Just enforce max_length
"gaussian" Prefer target length target_length, spread
"lognormal" Natural distribution target_length, spread (0.3-0.7)
"clipped_polynomial" Hard enforcement target_length, spread

Language-Aware Defaults (70+ languages)

Automatic target_length/spread based on language:

  • East Asian (zh, ja, ko): shorter (45-55 chars)
  • Germanic (de, nl, en): medium-long (75-90 chars)
  • Romance/Slavic (fr, es, ru): medium-long (78-85 chars)
# Auto-applies when using LoRA with language
sat = SaT("sat-3l", style_or_domain="ud", language="de")
sat.split(text, max_length=150, prior_type="gaussian")  # German defaults

# Or explicit
sat.split(text, max_length=100, prior_type="gaussian", prior_kwargs={"lang_code": "zh"})

Text Reconstruction

# With constraints: "".join(segments) == original
# Without constraints: "\n".join(segments) == original

🆕 New Files

File Lines Description
wtpsplit/utils/constraints.py 494 Viterbi DP & greedy algorithms, constraint enforcement
wtpsplit/utils/priors.py 198 Prior functions + 70+ language defaults
test_length_constraints.py 1,164 98 test cases
length_constrained_segmentation_demo.py 450 Interactive demo
docs/LENGTH_CONSTRAINTS.md 288 Math & implementation docs

📝 Key Modifications

wtpsplit/__init__.py (+276 lines)

  • New parameters for WtP.split() / SaT.split()
  • Input validation, warnings when threshold ignored with max_length
  • Type hints fixed with from __future__ import annotations

wtpsplit/utils/__init__.py

  • Bug fix: from cached_property import ...from functools import cached_property

README.md (+84 lines)

  • New section: "(NEW! v2.2+) Length-Constrained Segmentation"

🔧 Build & CI Changes

setup.py

python_requires=">=3.9"                    # NEW: drops 3.7, 3.8
"transformers>=4.22.2,<5.0"               # Pinned (v5.0 breaking)
"huggingface-hub<1.0"                     # Pinned (HfFolder removed)
# Removed: "cached_property" (stdlib in 3.9+)

pyproject.toml

  • target-version: py38...py39, py310, py311, py312
  • [tool.ruff.per-file-ignores][tool.ruff.lint.per-file-ignores]

.github/workflows/python.yml

  • Python matrix: removed 3.8, added 3.12
  • Updated to actions/checkout@v4, actions/setup-python@v5
  • ruffruff check --target-version=py39

requirements.txt

  • Pinned huggingface-hub==0.25.2
  • Removed cached_property

⚠️ Breaking Changes

Change Impact
Python ≥3.9 Drops 3.7, 3.8
transformers <5.0 v5.0 has breaking API
huggingface-hub <1.0 v1.0 removes HfFolder

🧪 Tests

File Tests
test.py +6 new constraint tests
test_length_constraints.py 98 tests (NEW)
Total 130 tests ✅

markus583 avatar Dec 05 '25 04:12 markus583