ng-video-lecture
ng-video-lecture copied to clipboard
Shouldn't we be dividing when normalizing QK^T, not multiplying?
In the code below, the query-key dot product is normalized by multiplying by the square root of the head size:
https://github.com/karpathy/ng-video-lecture/blob/52201428ed7b46804849dea0b3ccf0de9df1a5c3/gpt.py#L83
Should we not be dividing instead? As seen in the original paper: