Tyler Kastner
Results
1
issues of
Tyler Kastner
In the code below, the query-key dot product is normalized by multiplying by the square root of the head size: https://github.com/karpathy/ng-video-lecture/blob/52201428ed7b46804849dea0b3ccf0de9df1a5c3/gpt.py#L83 Should we not be dividing instead? As seen in...