Sid comments

Results 13 comments of

Sid

torch.compile() drops transformer/Qwen1.5-7B model output quality from good to unusable

I've also gotten much worse accuracy using torch.compile, here's a repro (for a Text-to-Speech model). With the following changes, the final speech output is much worse. Same worse results when...

torch.compile() drops transformer/Qwen1.5-7B model output quality from good to unusable

StyleTTS2 is a model for generating speech from text. It doesn't currently use torch.compile in any capacity. I tried modifying it by adding a single torch.compile to a small part...

torch.compile() drops transformer/Qwen1.5-7B model output quality from good to unusable

@ezyang Hey, just wanted to follow up on this. Any ideas on what I could do on my end to help you guys investigate this further?

Chunked context incomplete outputs

Any updates on this? It would be great to see the full speedup from this feature https://github.com/NVIDIA/TensorRT-LLM/issues/317#issuecomment-1810841752

use_fp8_context_fmha broken outputs

The following builds, including `--enable_xqa disable`, all had the same issue. Is there an example that uses `use_fp8_context_fmha enable` that I can reference to verify my build setup is correct?...

use_fp8_context_fmha broken outputs

@PerkzZheng thanks for pointing out the tests. I got unrelated runtime errors with run.py, but the summarize.py output looks correct. For reference, I'm using this model in the following tests...

use_fp8_context_fmha broken outputs

Thank you for the update! @PerkzZheng @kaiyux Unfortunately I'm still getting the same issue where outputs for concurrent requests are bad. The following info is using a Llama2 7B model...

use_fp8_context_fmha broken outputs

I still get the same issue with that command. Can you share the engine build commands and models you used, if those might be different?

use_fp8_context_fmha broken outputs

@PerkzZheng thanks, I got good outputs using the exact same commands you listed But I got bad outputs when I tweaked the commands for tp=2. Tensor parallelism might be the...

use_fp8_context_fmha broken outputs

It seems that some TP builds with certain inputs cause bad outputs. Below are different model and TP builds each tested with 3 different inputs. I've also listed the outputs...