export model to fp16
Ok. I see you went for a much deeper change.
Did you manage to test it?
It is not tested. I am trying to implement the load of model (version 0 and maybe 1)
Question: what is the benefit of fp16?
- As the Llama 2 models were trained in bf16 I find fp16 conversion sketchy. For newly trained models this is less of a concern
- The file sizes are ofc ~2X smaller
- The code is a little bit more bloated
Am I missing some considerations?
The point is that they can be directly loaded into the GPU. Not needing conversion on-the-flying (and having a smaller file to load) significantly reduce the load time (which, for my Tesla T4 is around 2min. for the llama2_7b models.
Also, I tested llama2.c on an ARM machine using their native support for fp16 and it works like a charm (and ARM CPU are cheaper on AWS).