Update 4_compile.py
Haven't ran this yet but included 2 important caveats that will have a big performance implication - will run this next and see what else we can debug
Awesome, thanks for jumping in here. Would love to get some insights wrt to how to improve that. I should mentioned, I used CUDA 11.8.
Let me try the sample batch idea!
Ah your batch size is also quite small so might be best to try out torch.compile(m, mode="reduce-overhead") which will automatically enable cuda graphs for you
Recently added some docs to make most of this clearer https://pytorch.org/docs/master/compile/index.html
Ah your batch size is also quite small so might be best to try out
torch.compile(m, mode="reduce-overhead")which will automatically enable cuda graphs for youRecently added some docs to make most of this clearer https://pytorch.org/docs/master/compile/index.html
Thanks, I tried that originally and it didn't really help :(.
Yeah should be in combination with tensor cores