llama2.c icon indicating copy to clipboard operation
llama2.c copied to clipboard

implementin an Engine to serve the trained model by inferencing

Open Majdoddin opened this issue 2 years ago • 2 comments
trafficstars

see #346

Majdoddin avatar Aug 26 '23 12:08 Majdoddin

for each token, the resulting tensors should be concatenated to the cache tensors, which is a costly O(n) operation. Is it not better to initialize each cache as [max_new_tokens, ...]?

Majdoddin avatar Aug 26 '23 12:08 Majdoddin

Have just added kv_cache for RMSNorm, to show how it works. @karpathy how do you like it?

Majdoddin avatar Aug 26 '23 19:08 Majdoddin