LLama 3/3.1 70B Outputting "!!!!!!"; Shorter Context
System Info
text-generation-launcher --env 2024-07-26T03:39:42.960734Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.79.0 Commit sha: 3905f854ed49b0bc50e6c983d3e6b254fcf02288 Docker label: sha-3905f85 nvidia-smi: Fri Jul 26 03:39:42 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:0E:00.0 Off | N/A | | 49% 62C P2 113W / 350W | 22668MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 Off | 00000000:0F:00.0 Off | N/A | | 30% 53C P2 102W / 350W | 21924MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+ xpu-smi: N/A 2024-07-26T03:39:42.960780Z INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "4e90e37e133c", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/root/.cache", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4, lora_adapters: None, disable_usage_stats: false, disable_crash_reports: false, }
I am using Ubuntu 22.04
I have two RTX 3090s.
I am using the latest docker images as of 7/25/24.
Output from pip list: Package Version
accelerate 0.29.3 aiohttp 3.9.5 aiosignal 1.3.1 annotated-types 0.7.0 archspec 0.2.3 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.43.2 boltons 24.0.0 Brotli 1.1.0 certifi 2024.7.4 cffi 1.16.0 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 colorama 0.4.6 conda 24.5.0 conda-libmamba-solver 24.1.0 conda-package-handling 2.2.0 conda_package_streaming 0.9.0 datasets 2.20.0 Deprecated 1.2.14 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 einops 0.6.1 filelock 3.15.4 frozendict 2.4.4 frozenlist 1.4.1 fsspec 2024.5.0 gmpy2 2.1.5 googleapis-common-protos 1.63.2 grpc-interceptor 0.15.4 grpcio 1.65.1 grpcio-reflection 1.62.2 grpcio-status 1.62.2 grpcio-tools 1.62.2 hf_transfer 0.1.8 huggingface-hub 0.23.5 idna 3.7 importlib_metadata 7.1.0 interegular 0.3.3 Jinja2 3.1.4 joblib 1.4.2 jsonpatch 1.33 jsonpointer 2.4 jsonschema 4.23.0 jsonschema-specifications 2023.12.1 lark 1.1.9 libmambapy 1.5.8 llvmlite 0.43.0 loguru 0.6.0 mamba 1.5.8 MarkupSafe 2.1.5 menuinst 2.0.2 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 mypy-protobuf 3.6.0 nest-asyncio 1.6.0 networkx 3.3 numba 0.60.0 numpy 1.26.4 nvidia-nccl-cu12 2.22.3 opentelemetry-api 1.25.0 opentelemetry-exporter-otlp 1.25.0 opentelemetry-exporter-otlp-proto-common 1.25.0 opentelemetry-exporter-otlp-proto-grpc 1.25.0 opentelemetry-exporter-otlp-proto-http 1.25.0 opentelemetry-instrumentation 0.46b0 opentelemetry-instrumentation-grpc 0.46b0 opentelemetry-proto 1.25.0 opentelemetry-sdk 1.25.0 opentelemetry-semantic-conventions 0.46b0 outlines 0.0.34 packaging 24.1 pandas 2.2.2 peft 0.10.0 pillow 10.4.0 pip 24.0 platformdirs 4.2.0 pluggy 1.4.0 prometheus_client 0.20.0 protobuf 4.25.3 psutil 6.0.0 py-cpuinfo 9.0.0 pyarrow 17.0.0 pyarrow-hotfix 0.6 pycosat 0.6.6 pycparser 2.22 pydantic 2.8.2 pydantic_core 2.20.1 PySocks 1.7.1 python-dateutil 2.9.0.post0 pytz 2024.1 PyYAML 6.0.1 referencing 0.35.1 regex 2024.5.15 requests 2.32.3 rpds-py 0.19.1 ruamel.yaml 0.18.6 ruamel.yaml.clib 0.2.8 safetensors 0.4.3 scipy 1.13.1 sentencepiece 0.1.99 setuptools 71.1.0 six 1.16.0 sympy 1.13.0 text-generation-server 2.0.5.dev0 texttable 1.7.0 tokenizers 0.19.1 torch 2.4.0 tqdm 4.66.4 transformers 4.43.1 triton 3.0.0 truststore 0.8.0 typer 0.6.1 types-protobuf 5.27.0.20240626 typing_extensions 4.12.2 tzdata 2024.1 urllib3 2.2.2 wheel 0.43.0 wrapt 1.16.0 xxhash 3.4.1 yarl 1.9.4 zipp 3.19.2 zstandard 0.22.0
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
Run TGI like the following: --model-id meta-llama/Meta-Llama-3.1-70B-Instruct --huggingface-hub-cache /root/.cache/huggingface/hub --trust-remote-code --max-input-length 2047 --max-total-tokens 2048 --quantize bitsandbytes-nf4
Query the model and notice that a high percentage of the time you get "!!!!" but not always.
Also, I use to be able to run LLama 3 70B models with 5k context with dual 3090s. Now I can not run the LLama 2 70B models without error most of the time, but It won't even fully load unless I drop the context window to something like 4k
Expected behavior
I expect to be able to use the model as expected and the context window should expand or stay the same with updates, not go down.
This just happened with the 8B model too. I am thinking it may have something to do with bits and bytes but I am not sure.
Happening when not using quantization as well. Still pseudo-random.
Rebooting somtimes helps. Maybe its a hardware issue.
Interesting. I haven't found this issue with 8B on A10g or 405B H100. Would be curious to know if it's indeed a hardware issue.
It didn't happen on previous versions so if it is hardware related, it's a recent development or it's a bug introduced.
I am having an issue with the quality, they are not as good as Llama-3-70B.