OpenLLM icon indicating copy to clipboard operation
OpenLLM copied to clipboard

bug: Output text from CompletionChunk is different with tokenizer.decode

Open jeffwang0516 opened this issue 1 year ago • 7 comments

Describe the bug

I'm recently trying to use a fine-tuned version of llama2 that supports Traditional Chinese. https://huggingface.co/yentinglin/Taiwan-LLM-7B-v2.1-chat

The output text from CompletionChunk seems to be having some encoding issue I guess. If I directly use tokenizer.decode from generated token_ids, the output is fine.

To reproduce

Here's how to reproduce the issue:

import openllm
import asyncio

llm = openllm.LLM('yentinglin/Taiwan-LLM-7B-v2.1-chat')
prompt = '你是一個人工智慧助理</s>USER: 東北季風如何影響台灣氣候?</s>ASSISTANT:'

async def generate(prompt, **attrs): return await llm.generate(prompt, **attrs)

output = asyncio.run(generate(prompt))
out1 = output.outputs[0]

print("Output:", out1.text)

print("Output from decoding token_ids directly:", llm.tokenizer.decode(out1.token_ids , skip_special_tokens=True, spaces_between_special_tokens=False, clean_up_tokenization_spaces=True))

Output:

$ BACKEND=pt python reproduce.py
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.12s/it]
Output: 東北��風������的������使得台��的��候������。 ������������風��會������大風和������,有時會������山區��石流和��水。
Output from decoding token_ids directly: 東北季風帶來的降雨使得台灣的氣候溼潤。 這種季風還會帶來大風和暴雨,有時會導致山區泥石流和洪水。����風還會帶來大風和暴雨,有時會導致山區泥石流和洪水。

Logs

No response

Environment

Environment variable

BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.1.10 python: 3.8.10 platform: Linux-5.4.0-169-generic-x86_64-with-glibc2.29 uid_gid: 1000:1000

pip_packages
accelerate==0.25.0
aiohttp==3.9.1
aiosignal==1.3.1
anyio==4.2.0
appdirs==1.4.4
asgiref==3.7.2
async-timeout==4.0.3
attrs==23.1.0
bentoml==1.1.10
bitsandbytes==0.41.3.post2
build==0.10.0
cattrs==23.1.2
certifi==2023.11.17
charset-normalizer==3.3.2
circus==0.18.0
click==8.1.7
click-option-group==0.5.6
cloudpickle==3.0.0
coloredlogs==15.0.1
contextlib2==21.6.0
cuda-python==12.3.0
datasets==2.15.0
deepmerge==1.1.1
Deprecated==1.2.14
dill==0.3.7
distlib==0.3.8
distro==1.8.0
einops==0.7.0
exceptiongroup==1.2.0
fastcore==1.5.29
filelock==3.9.0
frozenlist==1.4.1
fs==2.4.16
fsspec==2023.12.2
ghapi==1.0.4
h11==0.14.0
httpcore==1.0.2
httpx==0.26.0
huggingface-hub==0.20.1
humanfriendly==10.0
idna==3.6
importlib-metadata==6.11.0
inflection==0.5.1
Jinja2==3.1.2
markdown-it-py==3.0.0
MarkupSafe==2.1.3
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.15
mypy-extensions==1.0.0
networkx==3.0
numpy==1.24.4
nvidia-ml-py==11.525.150
openllm==0.4.41
openllm-client==0.4.41
openllm-core==0.4.41
opentelemetry-api==1.20.0
opentelemetry-instrumentation==0.41b0
opentelemetry-instrumentation-aiohttp-client==0.41b0
opentelemetry-instrumentation-asgi==0.41b0
opentelemetry-sdk==1.20.0
opentelemetry-semantic-conventions==0.41b0
opentelemetry-util-http==0.41b0
optimum==1.16.1
orjson==3.9.10
packaging==23.2
pandas==2.0.3
pathspec==0.12.1
pip-requirements-parser==32.0.1
pip-tools==7.3.0
platformdirs==4.1.0
prometheus-client==0.19.0
psutil==5.9.7
pyarrow==14.0.2
pyarrow-hotfix==0.6
pygments==2.17.2
pyparsing==3.1.1
pyproject-hooks==1.0.0
python-dateutil==2.8.2
python-json-logger==2.0.7
python-multipart==0.0.6
pytz==2023.3.post1
PyYAML==6.0.1
pyzmq==25.1.2
regex==2023.10.3
requests==2.31.0
rich==13.7.0
safetensors==0.4.1
schema==0.7.5
scipy==1.10.1
sentencepiece==0.1.99
simple-di==0.1.5
six==1.16.0
sniffio==1.3.0
starlette==0.34.0
sympy==1.12
tokenizers==0.15.0
tomli==2.0.1
torch==2.1.0+cu121
tornado==6.4
tqdm==4.66.1
transformers==4.36.2
triton==2.1.0
typing-extensions==4.4.0
tzdata==2023.3
urllib3==2.1.0
uvicorn==0.25.0
virtualenv==20.25.0
watchfiles==0.21.0
wrapt==1.16.0
xxhash==3.4.1
yarl==1.9.4
zipp==3.17.0

System information (Optional)

  • transformers version: 4.36.2
  • Platform: Linux-5.4.0-169-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.20.1
  • Safetensors version: 0.4.1
  • Accelerate version: 0.25.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

jeffwang0516 avatar Dec 23 '23 02:12 jeffwang0516

Hi there, thanks for creating the issue.

Do you have vllm available locally?

aarnphm avatar Dec 23 '23 21:12 aarnphm

Hi

I'm still not able to run this model with vllm backend due to insufficient gpu mem (T4 16g seems not enough)

After some research, I think the root cause of this might be a single complete chinese character may be decoded from multiple token outputs. So decoding to text on every generate iteration is not feasible for Chinese.

jeffwang0516 avatar Dec 24 '23 04:12 jeffwang0516

Sounds like a orthogonal issue from OpenLLM?

aarnphm avatar Dec 24 '23 12:12 aarnphm

For pytorch backend, it is related to OpenLLM in the implementation of PyTorchRunnable. It might need some way to detect incomplete character on each generation, probably something like what text-generation-inference server had here OR transformers TextStreamer done here

If vllm backend has this handled, then OpenLLM will be doing fine. But I'm not able to verify it at the moment.

jeffwang0516 avatar Dec 25 '23 02:12 jeffwang0516

Tried to fix the problem with the text-generation-inference server approach (Related issue: https://github.com/huggingface/text-generation-inference/issues/333) Please have a look, thanks!

jeffwang0516 avatar Dec 25 '23 07:12 jeffwang0516

For pytorch backend, it is related to OpenLLM in the implementation of PyTorchRunnable. It might need some way to detect incomplete character on each generation, probably something like what text-generation-inference server had here OR transformers TextStreamer done here

If vllm backend has this handled, then OpenLLM will be doing fine. But I'm not able to verify it at the moment.

FYI, found that vllm had also fix this issue with text-generation-inference approach in this pr https://github.com/vllm-project/vllm/pull/984

jeffwang0516 avatar Dec 26 '23 04:12 jeffwang0516

I will take a look into detokenization incrementally for PyTorch backend.

aarnphm avatar Dec 27 '23 05:12 aarnphm

close for openllm 0.6

bojiang avatar Jul 12 '24 01:07 bojiang