Pointwise kernel choose grid size based on number of CU
https://github.com/ROCmSoftwarePlatform/composable_kernel/pull/261#discussion_r883726267
Elementwise and maxpool backward kernel suffer from this issue. As discussed with @qianfengz , this might need to modify StreamConfig
I will submit a PR to solve the issue. And specifically, the P.R will provide an interface like
`int getAvailableComputeUnitCount(StreamConfig & stream_cfg); `
The interface will call hipExtStreamGetCUMask() and count the masks to get the number of available CUs.
Also in the P.R, a hipDevice_t device_id = 0; will be added to the definition of StreamConfig. This is needed in case of some situation, more information need be queried from the device underlying the current stream.