Sid

Results 13 comments of Sid

I've also gotten much worse accuracy using torch.compile, here's a repro (for a Text-to-Speech model). With the following changes, the final speech output is much worse. Same worse results when...

StyleTTS2 is a model for generating speech from text. It doesn't currently use torch.compile in any capacity. I tried modifying it by adding a single torch.compile to a small part...

@ezyang Hey, just wanted to follow up on this. Any ideas on what I could do on my end to help you guys investigate this further?

Any updates on this? It would be great to see the full speedup from this feature https://github.com/NVIDIA/TensorRT-LLM/issues/317#issuecomment-1810841752

The following builds, including `--enable_xqa disable`, all had the same issue. Is there an example that uses `use_fp8_context_fmha enable` that I can reference to verify my build setup is correct?...

@PerkzZheng thanks for pointing out the tests. I got unrelated runtime errors with run.py, but the summarize.py output looks correct. For reference, I'm using this model in the following tests...

Thank you for the update! @PerkzZheng @kaiyux Unfortunately I'm still getting the same issue where outputs for concurrent requests are bad. The following info is using a Llama2 7B model...

I still get the same issue with that command. Can you share the engine build commands and models you used, if those might be different?

@PerkzZheng thanks, I got good outputs using the exact same commands you listed But I got bad outputs when I tweaked the commands for tp=2. Tensor parallelism might be the...

It seems that some TP builds with certain inputs cause bad outputs. Below are different model and TP builds each tested with 3 different inputs. I've also listed the outputs...