wtpsplit Finetune Issues

I've successfully finetuned the $\text{segment-any-text/sat-12l}$ model for a custom task: segmenting text into logical steps. The training went well and produced excellent evaluation results, but I'm encountering an issue where the model appears to ignore the adapter weights during inference and defaults to word-level splitting, even when applying the best-performing threshold from the training log.

Am I missing a crucial step in loading the LoRA weights or applying the threshold correctly?

1. Finetuning Configuration

I based my configuration on the $\text{tweet}$ segmenting example. The goal was to teach the model to identify step boundaries instead of sentence/tweet boundaries.

Configuration (config.json equivalent):

{
    "model_name_or_path": "segment-any-text/sat-12l",
    "output_dir": "/path-to/wtpsplit/data",
    "text_path": "/path-to/wtpsplit/data/steps.pth",
    "block_size": 256,
    "eval_stride": 128,
    "do_train": true,
    "do_eval": true,
    "per_device_train_batch_size": 8,
    "per_device_eval_batch_size": 8,
    "gradient_accumulation_steps": 1,
    "eval_accumulation_steps": 8,
    "dataloader_num_workers": 1,
    "preprocessing_num_workers": 1,
    "learning_rate": 3e-4,
    "fp16": false,
    "num_train_epochs": 5,
    "logging_steps": 50,
    "report_to": "wandb",
    "wandb_project": "sentence",
    "save_steps": 100000000,
    "remove_unused_columns": false,
    "one_sample_per_line": true,
    "do_sentence_training": true,
    "do_auxiliary_training": false,
    "warmup_ratio": 0.1,
    "non_punctuation_sample_ratio": null,
    "prediction_loss_only": true,
    "use_auxiliary": true,
    "ddp_timeout": 3600,
    "use_subwords": true,
    "custom_punctuation_file": "punctuation_xlmr_unk.txt",
    "log_level": "warning",
    "adapter_config": "lora[r=16,alpha=32,intermediate_lora=True]",
    "weight_decay": 0.01,
    "auxiliary_remove_prob": 0.0,
    "shuffle": false,
    "train_adapter": true,
    "subsample": 56000
}

2. Evaluation Results

After 5 epochs, the evaluation results were very promising for the task, with a high $\text{F1}$ score on my custom evaluation set:

Evaluation Log:

{
    'eval_neg_steps/en/loss': 0.00049071223475039, 
    'eval_loss': 0.00049071223475039, 
    'eval_neg_steps/en/pr_auc': 0.9616394527440739, 
    'eval_neg_steps/en/f1': 0.9076705425291773, 
    'eval_neg_steps/en/f1_best': 0.9076705424796035, 
    'eval_neg_steps/en/threshold_best': 0.5806168724807065, # <--- BEST THRESHOLD
    'eval_runtime': 767.7237, 
    'eval_samples_per_second': 18.152, 
    'eval_steps_per_second': 2.269, 
    'epoch': 5.0
}

The best threshold found was $\text{0.5806}$.

3. Inference Issue

When attempting to load the model and perform inference, the output defaults to splitting the text at the word level, which suggests the model is not using the LoRA weights or is ignoring the segmentation logic entirely.

Inference Code:

from functools import lru_cache
from wtpsplit.model import SaT # Assuming SaT is the SegmentAnyText wrapper

@lru_cache(maxsize=1)
def load_seg_model(model_base: str, lora_path: str, language: str):
    return SaT(model_base,
               lora_path=lora_path,
               # language=language, # Commented out, but tried both ways
               )


if __name__ == "__main__":
    # BP is base path
    m = load_seg_model(model_base="sat-12l",
                       lora_path=f"{BP}/data/steps/en", # Path to the finetuned adapter weights
                       language="en"
                       )
    m.half().to("cuda")
    
    # Trying the best threshold from evaluation log
    test_text = "This is the first step. Next comes the second action. Finally, we finish the task."
    print(*m.split(test_text, threshold=0.5806), sep="\n")

Expected Output:

This is the first step.
Next comes the second action.
Finally, we finish the task.

Actual Output (Example based on description - defaults to word-level):

This
is
the
first
step.
Next
comes
the
second
action.
Finally,
we
finish
the
task.

Oct 23 '25 09:10 LeonHammerla

Hi,

Sorry about the late response, I just came back from a conference. I looked into your issue and I'm not fully sure about the exact issue but I have a few ideas:

I realize there may be a small bug in reporting these values as the sigmoid is applied in precision_recall_curve and then sigmoid(p) in the logging code. I can't verify this now due to lack of hardware but what you can try is to simply apply a threshold of 0.33 instead (after reversing the 2nd sigmoid, based on 0.58 before) - does this produce the desired results?
It may also be related due to how we monkey-patch adapter-compatibility with what I used during training (see requirements.txt) and the latest versions. You can try setting SaT(..., merge_lora=False), which I just pushed in the latest version, 2.1.7.
In general, ensure you are using the exact requirements when training the model, as per requirements.txt.

I suspect it is 1. - please let me know what the culprit is! Hope that helps.

Nov 19 '25 08:11 markus583

Plus, you should also set the threshold of custom LoRA modules when doing inference:

sentences = model.split(texts, threshold=0.33)

Otherwise, only default values will be applied (see L810 in init.py)

Nov 19 '25 08:11 markus583