Tyler Kastner

Results 1 issues of Tyler Kastner

In the code below, the query-key dot product is normalized by multiplying by the square root of the head size: https://github.com/karpathy/ng-video-lecture/blob/52201428ed7b46804849dea0b3ccf0de9df1a5c3/gpt.py#L83 Should we not be dividing instead? As seen in...