CPUAdam fp16 and bf16 support
Hi. Please review the following changes I added support for BF16 to cpu adam. BF16, FP16 and float are supported at compilation time. the correct template is called at runtime according to input params dtype.
@BacharL, thanks for this incredible improvement to the offloading optimizers and op builders. I left a few comments and questions, but overall looks good to me.
@tjruwase Thanks for reviewing this change. I have made changes to address your comments. Now there is no need to pass HALF_DTYPE as compiler define. all functions will be templated according to ds_device_precision_t. removed all half_precision parameters.
Added templated invoker to help selecting the implementation
The map stores function pointers to templated functions, the key is the type enum. At initialization all supported dtypes are templated and inserted into the map.
I didn't clean ds_adagrad_step_plus_copy and related code under __ENABLE_CUDA__ but also couldn't test it.