text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

LLama 3/3.1 70B Outputting "!!!!!!"; Shorter Context

Open mallorbc opened this issue 6 months ago • 5 comments

System Info

text-generation-launcher --env 2024-07-26T03:39:42.960734Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.79.0 Commit sha: 3905f854ed49b0bc50e6c983d3e6b254fcf02288 Docker label: sha-3905f85 nvidia-smi: Fri Jul 26 03:39:42 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:0E:00.0 Off | N/A | | 49% 62C P2 113W / 350W | 22668MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 Off | 00000000:0F:00.0 Off | N/A | | 30% 53C P2 102W / 350W | 21924MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+ xpu-smi: N/A 2024-07-26T03:39:42.960780Z INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "4e90e37e133c", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/root/.cache", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4, lora_adapters: None, disable_usage_stats: false, disable_crash_reports: false, }

I am using Ubuntu 22.04

I have two RTX 3090s.

I am using the latest docker images as of 7/25/24.

Output from pip list: Package Version


accelerate 0.29.3 aiohttp 3.9.5 aiosignal 1.3.1 annotated-types 0.7.0 archspec 0.2.3 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.43.2 boltons 24.0.0 Brotli 1.1.0 certifi 2024.7.4 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 colorama 0.4.6 conda 24.5.0 conda-libmamba-solver 24.1.0 conda-package-handling 2.2.0 conda_package_streaming 0.9.0 datasets 2.20.0 Deprecated 1.2.14 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 einops 0.6.1 filelock 3.15.4 frozendict 2.4.4 frozenlist 1.4.1 fsspec 2024.5.0 gmpy2 2.1.5 googleapis-common-protos 1.63.2 grpc-interceptor 0.15.4 grpcio 1.65.1 grpcio-reflection 1.62.2 grpcio-status 1.62.2 grpcio-tools 1.62.2 hf_transfer 0.1.8 huggingface-hub 0.23.5 idna 3.7 importlib_metadata 7.1.0 interegular 0.3.3 Jinja2 3.1.4 joblib 1.4.2 jsonpatch 1.33 jsonpointer 2.4 jsonschema 4.23.0 jsonschema-specifications 2023.12.1 lark 1.1.9 libmambapy 1.5.8 llvmlite 0.43.0 loguru 0.6.0 mamba 1.5.8 MarkupSafe 2.1.5 menuinst 2.0.2 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 mypy-protobuf 3.6.0 nest-asyncio 1.6.0 networkx 3.3 numba 0.60.0 numpy 1.26.4 nvidia-nccl-cu12 2.22.3 opentelemetry-api 1.25.0 opentelemetry-exporter-otlp 1.25.0 opentelemetry-exporter-otlp-proto-common 1.25.0 opentelemetry-exporter-otlp-proto-grpc 1.25.0 opentelemetry-exporter-otlp-proto-http 1.25.0 opentelemetry-instrumentation 0.46b0 opentelemetry-instrumentation-grpc 0.46b0 opentelemetry-proto 1.25.0 opentelemetry-sdk 1.25.0 opentelemetry-semantic-conventions 0.46b0 outlines 0.0.34 packaging 24.1 pandas 2.2.2 peft 0.10.0 pillow 10.4.0 pip 24.0 platformdirs 4.2.0 pluggy 1.4.0 prometheus_client 0.20.0 protobuf 4.25.3 psutil 6.0.0 py-cpuinfo 9.0.0 pyarrow 17.0.0 pyarrow-hotfix 0.6 pycosat 0.6.6 pycparser 2.22 pydantic 2.8.2 pydantic_core 2.20.1 PySocks 1.7.1 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 referencing 0.35.1 regex 2024.5.15 requests 2.32.3 rpds-py 0.19.1 ruamel.yaml 0.18.6 ruamel.yaml.clib 0.2.8 safetensors 0.4.3 scipy 1.13.1 sentencepiece 0.1.99 setuptools 71.1.0 six 1.16.0 sympy 1.13.0 text-generation-server 2.0.5.dev0 texttable 1.7.0 tokenizers 0.19.1 torch 2.4.0 tqdm 4.66.4 transformers 4.43.1 triton 3.0.0 truststore 0.8.0 typer 0.6.1 types-protobuf 5.27.0.20240626 typing_extensions 4.12.2 tzdata 2024.1 urllib3 2.2.2 wheel 0.43.0 wrapt 1.16.0 xxhash 3.4.1 yarl 1.9.4 zipp 3.19.2 zstandard 0.22.0

Information

  • [X] Docker
  • [ ] The CLI directly

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

Run TGI like the following: --model-id meta-llama/Meta-Llama-3.1-70B-Instruct --huggingface-hub-cache /root/.cache/huggingface/hub --trust-remote-code --max-input-length 2047 --max-total-tokens 2048 --quantize bitsandbytes-nf4

Query the model and notice that a high percentage of the time you get "!!!!" but not always.

Also, I use to be able to run LLama 3 70B models with 5k context with dual 3090s. Now I can not run the LLama 2 70B models without error most of the time, but It won't even fully load unless I drop the context window to something like 4k

Expected behavior

I expect to be able to use the model as expected and the context window should expand or stay the same with updates, not go down.

mallorbc avatar Jul 26 '24 03:07 mallorbc