llama2.c
llama2.c copied to clipboard
implementin an Engine to serve the trained model by inferencing
trafficstars
see #346
for each token, the resulting tensors should be concatenated to the cache tensors, which is a costly O(n) operation. Is it not better to initialize each cache as [max_new_tokens, ...]?
Have just added kv_cache for RMSNorm, to show how it works. @karpathy how do you like it?