Ying Zhang
Ying Zhang
Thanks for your contribution! From the stack trace, it seems that the input tensor type is fp16 instead of fp32. Maybe some previous operators have hard-coded fp16 as the output...
cc @fsx950223
I think for individual gemm kernels, AIT should behave similar compared to rocBlas. AIT's perf gain mostly come from operator fusions.
btw AIT doesn't run well on rocm even on Linux right now. There is an ongoing PR to support rocm on Linux: https://github.com/facebookincubator/AITemplate/pull/146. cc @asroy, @fsx950223
Thanks @fsx950223 for your fix and adding the AMD CI! For some reason the CircleCI pipeline fails, will work on manually merge the PR into our internal repo and run...
And it doesn't seem like the AMD CI is triggered. Has it been enabled successfully? @fsx950223
Unfortunately not all kernels run well on SM75 GPUs. Check this readme: https://fburl.com/pimcs20r.
@carlushuang Please send a PR and we'll merge your fix into upstream, thanks!
AIT runtime has two parts: CPU part and GPU part. CPU part relies on num_runtimes to do parallelization, while GPU part relies on stream to do parallelization. It's valid to...
I think in the multi-processor case, each process may write to the same GPU memory which causes errors. Need to check the detailed message to confirm. wrt dynamic batching support:...