llama2.c implementin an Engine to serve the trained model by inferencing

implementin an Engine to serve the trained model by inferencing

Open Majdoddin opened this issue 2 years ago • 2 comments

trafficstars

see #346

Aug 26 '23 12:08 Majdoddin

for each token, the resulting tensors should be concatenated to the cache tensors, which is a costly O(n) operation. Is it not better to initialize each cache as [max_new_tokens, ...]?

Aug 26 '23 12:08 Majdoddin

Have just added kv_cache for RMSNorm, to show how it works. @karpathy how do you like it?

Aug 26 '23 19:08 Majdoddin

llama2.c llama2.c copied to clipboard

implementin an Engine to serve the trained model by inferencing

llama2.c
llama2.c copied to clipboard