thies1006
thies1006
Hello! I'm generating texts with blender_3B like this (all options are default, except "model_parallel=False"): ``` agent = create_agent(opt, requireModelExists=True) agent_copies = [] agent_copies.append(agent.clone()) agent_copies.append(agent.clone()) #comment this out for 2nd try...
I'm running the inference script `bloom-ds-inference.py` by invoking: `deepspeed --num_gpus 1 ~/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom-1b3 --benchmark`, but I change the generation arguments to `generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=False, use_cache=False)` (adding use_cache option)....
### System Info ```shell optimum @ git+https://github.com/huggingface/optimum.git@3347a0a75f18b854979dd7e9f78d4c3ebb92852a transformers==4.21.3 onnxruntime==1.12.1 ``` ### Who can help? @lewtun, @michaelbenayoun ### Information - [ ] The official example scripts - [ ] My own...
**Describe the bug** Following #2547 I tried to run the model gpt-neoxt-chat-base-20b, which is a neox-20B derivative I think and I think it should work. Inference works if the model...
Hello! I noticed that when sending requests to my server via GRPC the two metrics ts_inference_latency_microseconds counter, ts_queue_latency_microseconds counter are not reported via metrics API (curl http://IP:PORT/metrics). However I do...
### System Info NCCL version 2.19.3+cuda12.0 TensorRT-LLM version: 0.11.0.dev2024052100 Ubuntu 22.04 ### Who can help? @byshiue ### Information - [X] The official example scripts - [ ] My own modified...
### Your current environment vllm==0.4.3 numpy==1.26.4 nvidia-nccl-cu12==2.20.5 torch==2.3.0 transformers==4.41.2 triton==2.3.0 ### 🐛 Describe the bug I don't know if this is a bug or if the model just doesn't support...
### Your current environment ```text Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.1...