nanoGPT Remove @torch.jit.script decorator when compiling the model?

Hey there, should we remove the @torch.jit.script decorator for the fused_gelu activation function if we compile the model? I benchmarked both and found the inference time to be substantially faster without the decorator. This was tested on an NVIDIA A6000 card. Thanks!

Jan 02 '23 21:01 vgoklani

I benchmarked both and found the decorater to speed things up, but this was before I added torch.compile. Are you using torch.compile?

Jan 02 '23 22:01 karpathy

wow, with torch.compile I am seeing a big difference in the opposite direction, probably as you're seeing as well. with @torch.jit.script: 138ms / iter, without: 118ms / iter

Jan 02 '23 22:01 karpathy

torch.compile is a better compiler than the now legacy torch.jit.script.

Jan 02 '23 22:01 soumith

yes, I'm using torch.compile! it's a huge difference!

Jan 02 '23 22:01 vgoklani

got it, i'm struggling a bit with the best way to fix because i'd like to keep torch.compile optional for now. looks like we have to propagate whether compile=True through the config now and script the module conditionally. looking into...

Jan 02 '23 22:01 karpathy

Compiling is a training option, so passing that flag into the model code will make a huge mess (spaghetti code). (The model needs that flag to compile the activation function) Leaving the decorator in the model.py code is suboptimal based on @soumith's comment.

Unfortunately you have to make a tough decision...

Personally, I'm removing the decorator and expecting the compiler to improve over time. Who wants to work on legacy code anyway :)

Jan 02 '23 22:01 vgoklani

Agree it seems really gross to propagate and keep track of a boolean for whether I intend to later torch.compile the model. I do think there are legitimate uses for not compiling the model because it does add latency. E.g. I compile for long training runs, but I skip compile for simple debugging runs, or even evaluations (e.g. you'll note sample.py doesn't compile by default). All that said yes leaning to just comment it out...

Jan 02 '23 22:01 karpathy

makes sense. One additional comment regarding your training code:

I'd suggest putting all the parameters inside a argparse.Namespace() object, it's just a lot cleaner then polluting the global namespace.

i.e.

args = argparse.Namespace(
    seed=42,
    number_threads=multiprocessing.cpu_count(),
    use_cuda=torch.cuda.is_available(),
    cudnn_deterministic=False,
    device=torch.device("cuda:0")
    if torch.cuda.is_available()
    else torch.device("cpu"),
)

Jan 02 '23 22:01 vgoklani

You're not wrong and it is very tempting but also I've been down this path before. Next thing you know I'll be using argparse, then I'll switch to yaml files, then I'll graduate to Hydra, and then I'll get so upset dealing with its internals, docs, errors and updates that I'll decide it's too complicated and come back to where we are :D

Jan 02 '23 22:01 karpathy

addressed in https://github.com/karpathy/nanoGPT/commit/177d5f7dc5f44d6f373cd7767c2a9259d740436e thanks @vgoklani , closing shortly.

Jan 02 '23 23:01 karpathy

Should also try:

torch.compile(model, mode="max-autotune")

there is a slight performance gain, which should help for long training runs...

Jan 03 '23 04:01 vgoklani

max-autotune will get a lot faster and memory pathologies will be fixed, once this PR goes in: https://github.com/pytorch/pytorch/pull/91575

for compile-times in general (and especially for max-autotune), we need to figure out internet-wide compile caches. That'd be pretty nice. It's TBD though because it's not hard to write the caching layer, but hosting it will take a bunch of general liability for other people's graphs. We'll have to figure it out, maybe convince huggingface or someone to host it.

Jan 03 '23 05:01 soumith

nanoGPT nanoGPT copied to clipboard

Remove @torch.jit.script decorator when compiling the model?

nanoGPT
nanoGPT copied to clipboard