James Whedbee issues

Repositories
Issues
Comments

Results 1 issues of


                                            James Whedbee

Speculative decoding slows model down, possibly from "skipping cudagraphs due to ['mutated inputs']"?

### Some context I am using AMD MI100 GPUs and I can get ~33 tokens/second for Llama 2 70B using - compile - tensor parallelism of 8 - int8 quantization...