tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

`random_seed` seems to be ignored (or at least inconsistent) for inflight_batcher_llm

Open dyoshida-continua opened this issue 1 year ago • 4 comments

System Info

I've converted Llama 3 using TensorRT-LLM's convert_checkpoint script, and am serving it with the inflight_batcher_llm template. I'm trying to get diverse samples for a fixed input, but I've found that if I make several requests concurrently, several will have identical outputs.

I'm setting top_p=1, top_k=1024, temperature=1.0, beam_width=1, and generating a unique random seed for each request. The requests are being made over the gRPC API, and I'm using v0.9.0 of TensorRT-LLM and tensorrtllm_backend.

Who can help?

@byshiue

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

  1. Serve a model (essentially following this guide, with some settings changes: https://developer.nvidia.com/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/)
  2. Make 5 gRPC requests concurrently

Expected behavior

I expect each request with a different seed to yield a different response

actual behavior

Several of the 5 responses are consistently identical

additional notes

I changed the script I'm using for testing to wait for a response before sending another request, and this results in all 5 outputs being distinct, so it seems like the concurrency/inflight batching really is the problem.

dyoshida-continua avatar May 21 '24 23:05 dyoshida-continua

Another detail which is interesting is that the identical sequences I observe in the concurrent case are the same run to run, even though I'm sampling the random seed from 1-1,000,000.

For example, with the input of <|begin_of_text|>Hello, my name is, I saw a continuation of of "Ahmed, and I am an experienced Software Engineer with proficiency..." in 3/5 responses, and then 2/5 responses on the next run. I did not observe this prefix at all when making requests serially.

dyoshida-continua avatar May 21 '24 23:05 dyoshida-continua

@byshiue I incorrectly typed your name when opening this issue originally. Can you comment on whether there's a workaround for this? It's currently making batch inference effectively useless.

dyoshida-continua avatar Jun 05 '24 22:06 dyoshida-continua

@byshiue I incorrectly typed your name when opening this issue originally. Can you comment on whether there's a workaround for this? It's currently making batch inference effectively useless.

@dyoshida-continua I applied the solution described in this pull request: NVIDIA/TensorRT-LLM#1742, and it resolved the issue for me.

chiendb97 avatar Jun 07 '24 06:06 chiendb97

Thank you for the help replying, @chiendb97 . Since the https://github.com/NVIDIA/TensorRT-LLM/pull/1742 is related to fix of random seed setting, it might be related to your issue, @dyoshida-continua . Could you take a try?

byshiue avatar Jun 07 '24 06:06 byshiue