GenAIExamples
GenAIExamples copied to clipboard
Document / support for using BFLOAT16 with (Xeon) TGI service
The model used for ChatQnA supports BFLOAT16, in addition to TGI's default 32-bit float type: https://huggingface.co/Intel/neural-chat-7b-v3-3
TGI memory usage halves from 30GB to 15GB (and also its perf increases somewhat) if one tells it to use BFLOAT16:
--- a/ChatQnA/kubernetes/manifests/tgi_service.yaml
+++ b/ChatQnA/kubernetes/manifests/tgi_service.yaml
@@ -28,6 +29,8 @@ spec:
args:
- --model-id
- $(LLM_MODEL_ID)
+ - --dtype
+ - bfloat16
#- "/data/Llama-2-7b-hf"
# - "/data/Mistral-7B-Instruct-v0.2"
# - --quantize
However, only newer Xeons support BFLOAT16. Therefore, if user' cluster has heterogeneous nodes, TGI service needs a node selector that schedules it on a node with BFLOAT16 support.
This can be automated by using node-feature-discovery and its CPU feature labeling: https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/features.html#cpu
It would be good to add some documentation and examples (e.g. comment lines in YAML) for this.