xla
xla copied to clipboard
[torchbench] `cm3leon_generate` inference running significantly slower than inductor.
🐛 Bug
python xla/benchmarks/experiment_runner.py \
--suite-name torchbench --accelerator cuda \
--xla PJRT --dynamo None --test eval \
--no-resume --print-subprocess \
-k hf_GPT2
compilation time (s) | iteration time (s) | |
---|---|---|
cm3leon_generate | 868 (850) | ~19 (1.65) |
(inductor time is shown in parenthesis)
Environment
- Reproducible on XLA backend [CPU/TPU/CUDA]: CUDA
- torch_xla version: a5692c206
cc @miladm
Here's an odd thing I've noticed: there's some compilation taking place in the second iteration:
iteration | time (s) | CompileTime number |
---|---|---|
1 | 887 | 1526 |
2 | 200 | 415 |
3 | 19 | 0 |
After the second iteration, I observe no more compilation. Inductor, in contrast, seems not to present this issue. i.e. its second iteration timing looks identical to the rest.