llama2.c export model to fp16

Aug 23 '23 22:08 kroggen

Ok. I see you went for a much deeper change.

Did you manage to test it?

Aug 23 '23 22:08 rdentato

It is not tested. I am trying to implement the load of model (version 0 and maybe 1)

Aug 23 '23 22:08 kroggen

Question: what is the benefit of fp16?

As the Llama 2 models were trained in bf16 I find fp16 conversion sketchy. For newly trained models this is less of a concern
The file sizes are ofc ~2X smaller
The code is a little bit more bloated

Am I missing some considerations?

Aug 25 '23 15:08 karpathy

The point is that they can be directly loaded into the GPU. Not needing conversion on-the-flying (and having a smaller file to load) significantly reduce the load time (which, for my Tesla T4 is around 2min. for the llama2_7b models.

Also, I tested llama2.c on an ARM machine using their native support for fp16 and it works like a charm (and ARM CPU are cheaper on AWS).

Aug 25 '23 15:08 rdentato