Stack from ghstack (oldest at bottom):
We should use this option during exporting 1B/3B models as bf16 because KVCache is always fp32. Otherwise, we see regressed performance for 1B/3B in bf16 format.
Differential Revision: D63871048