nanoGPT
nanoGPT copied to clipboard
Remove @torch.jit.script decorator when compiling the model?
Hey there, should we remove the @torch.jit.script decorator for the fused_gelu activation function if we compile the model? I benchmarked both and found the inference time to be substantially faster without the decorator. This was tested on an NVIDIA A6000 card. Thanks!
I benchmarked both and found the decorater to speed things up, but this was before I added torch.compile. Are you using torch.compile?
wow, with torch.compile I am seeing a big difference in the opposite direction, probably as you're seeing as well. with @torch.jit.script: 138ms / iter, without: 118ms / iter
torch.compile is a better compiler than the now legacy torch.jit.script.
yes, I'm using torch.compile! it's a huge difference!
got it, i'm struggling a bit with the best way to fix because i'd like to keep torch.compile optional for now. looks like we have to propagate whether compile=True through the config now and script the module conditionally. looking into...
Compiling is a training option, so passing that flag into the model code will make a huge mess (spaghetti code). (The model needs that flag to compile the activation function) Leaving the decorator in the model.py code is suboptimal based on @soumith's comment.
Unfortunately you have to make a tough decision...
Personally, I'm removing the decorator and expecting the compiler to improve over time. Who wants to work on legacy code anyway :)
Agree it seems really gross to propagate and keep track of a boolean for whether I intend to later torch.compile the model. I do think there are legitimate uses for not compiling the model because it does add latency. E.g. I compile for long training runs, but I skip compile for simple debugging runs, or even evaluations (e.g. you'll note sample.py doesn't compile by default). All that said yes leaning to just comment it out...
makes sense. One additional comment regarding your training code:
I'd suggest putting all the parameters inside a argparse.Namespace() object, it's just a lot cleaner then polluting the global namespace.
i.e.
args = argparse.Namespace(
seed=42,
number_threads=multiprocessing.cpu_count(),
use_cuda=torch.cuda.is_available(),
cudnn_deterministic=False,
device=torch.device("cuda:0")
if torch.cuda.is_available()
else torch.device("cpu"),
)
You're not wrong and it is very tempting but also I've been down this path before. Next thing you know I'll be using argparse, then I'll switch to yaml files, then I'll graduate to Hydra, and then I'll get so upset dealing with its internals, docs, errors and updates that I'll decide it's too complicated and come back to where we are :D
addressed in https://github.com/karpathy/nanoGPT/commit/177d5f7dc5f44d6f373cd7767c2a9259d740436e thanks @vgoklani , closing shortly.
Should also try:
torch.compile(model, mode="max-autotune")
there is a slight performance gain, which should help for long training runs...
max-autotune will get a lot faster and memory pathologies will be fixed, once this PR goes in: https://github.com/pytorch/pytorch/pull/91575
for compile-times in general (and especially for max-autotune), we need to figure out internet-wide compile caches. That'd be pretty nice. It's TBD though because it's not hard to write the caching layer, but hosting it will take a bunch of general liability for other people's graphs. We'll have to figure it out, maybe convince huggingface or someone to host it.