llm.c
llm.c copied to clipboard
Little speed up by simple modification is possible
In file train_gpt2.py;
You can replace the line return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0)))) with return 0.5 * input * (1.0 + torch.(0.66285246118) * (input + 0.044715 * torch.pow(input, 3.0))))
as tanh(math.sqrt(2.0 / math.pi) is approximately equal to 0.66285246118.
More instances can be found if the code is scanned carefully. This line alone can replace a divide+square root+trigonometric instructions (many many cycles in x64 and ARM) with a single constant.
Line in question here. Any idea on speed up/reduction in number of operations?