text-generation-inference
text-generation-inference copied to clipboard
`parameters.stop` only works if the stop sequences align with token boundaries
System Info
TGI Version: huggingface/text-generation-inference:1.0.3
(official docker image)
Model: abacaj/starcoderbase-1b-sft
OS: Ubuntu 20.04
GPU: RTX 3090 (runpod.io)
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id abacaj/starcoderbase-1b-sft
Par
works:
curl http://localhost:8080 \
-X POST \
-d '{"inputs": "The capital city of France", "parameters":{"stop":["Par"]}}}' \
-H "Content-Type: application/json"
# [{"generated_text":"The capital city of France is Par"}]
Pari
doesn't work:
curl http://localhost:8080 \
-X POST \
-d '{"inputs": "The capital city of France", "parameters":{"stop":["Pari"]}}}' \
-H "Content-Type: application/json"
# [{"generated_text":"The capital city of France is Paris.\n\nYou are given a string array named cities and an integer array named cap"}]
Expected behavior
Other APIs like OpenAI guarantee that the stop sequence text will not appear in the output, regardless of the details of the underlying tokenizer. That's what I expected. So rather than just checking if tokens (or groups of tokens) match any of the stop sequences, it should check against the full recently-generated segment of the output (i.e. at a character 'resolution' rather than token resolution).
I posted this in this closed issue, but I don't think it was noticed.
Note that this issue isn't specific to abacaj/starcoderbase-1b-sft
. I've replicated the issue with Llama 2 as well.
Bumped into the same thing. I guess there is variety implementations and interpretations for stop sequences. I had to do the same in Python code when manually implementing stop words to remove them from the response.
IDEA: I suppose in this case this could either be altered to not include the stop words in the response or a new option introduced. The latter reminds me of end_sequences
option in Cohere's endpoint reference.
The generated text will be cut at the beginning of the earliest occurence of an end sequence. The sequence will be excluded from the text.
@flexchar I like the solution of having two different options, as you've shown there. I don't like the idea of a breaking change to how stop
works.
The main thing that I'm actually concerned about here though is the fact that stop
doesn't currently work - i.e. I think the first priority is to fix the bug with stop
that's causing it to only stop if a sequence aligns with token boundaries.
I've also been bitten by this.
I'm not an expert, but would it make sense to implement "token healing", as described by the Guidance project here: https://github.com/guidance-ai/guidance/blob/main/notebooks/token_healing.ipynb
I have a suspicion (but no data, of course) that this is a common issue that impacts the quality of model outputs, but just not badly enough that anyone notices enough to fix it.
edit - I just realized this post is describing a slightly different problem, where the stop characters (not tokens) don't align with token boundaries. So a different issue I suppose.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
(Not stale.)
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.