hqq [rfc][dont merge] Use the skip_guard_eval stance to remove torch.compile guard overhead

An example to show how to use skip_guard_eval stance. Increases throughput from 160 tok/sec to 174 tok/sec on A100.

Oct 28 '24 05:10 anijain2305

Thank @anijain2305 ! How can I test it? I tried with the nightly (2.6.0.dev20241027+cu121) but I get

RuntimeError: invalid torch.compile stance 'DynamoStance(stance='skip_guard_eval', backend=None)'

On separate note, I am getting a huge speed-up 171 tokens/sec -> 186 tokens/sec by manually using Cuda graphs instead of relying on reduce-overhead

Oct 28 '24 08:10 mobicham

Thank @anijain2305 ! How can I test it? I tried with the nightly (2.6.0.dev20241027+cu121) but I get
RuntimeError: invalid torch.compile stance 'DynamoStance(stance='skip_guard_eval', backend=None)'
On separate note, I am getting a huge speed-up 171 tokens/sec -> 186 tokens/sec by manually using Cuda graphs instead of relying on reduce-overhead

@mobicham The stance is not ready for use. I am trying to gather feedback from torch.compile developers on the stance. If I see positive signs, I will work on it. It will take some time before this is ready.

Oct 28 '24 16:10 anijain2305

Understood @anijain2305 , thank you !

Oct 28 '24 17:10 mobicham