Larry Meadows
Larry Meadows
I'm late to the party but I don't see any significant difference between SYCL and CUDA scores on A100 once you accept the PR I just submitted :) . SYCL...
I just eyeballed the PTX. I should look more carefully to see where the extra instructions are coming from. There were definitely a lot of parameters, maybe there's some dead...
I ran on AMD MI100 both "spock" at ORNL and one of the nodes at ANL JLSE. Aside, the JLSE nodes are a tiny bit faster on this and other...
I ran SYCL2020 on A100, just the vanilla version with no USM changes. Dot is only 1198 GB/sec; I was geting 1290 GB/sec with SYCL version (not USM), and 1340...
SYCL2020 with a redone dot kernel (but not USM) doesn't do quite as well as the original SYCL version on dot on A100: 1235 GB/sec vs. 1292 GB/sec, and 1339...
Yes, I need to revisit SYCL-2020 vs. the previous version without the USM and be a little more rigorous. On the reduction, apparently it uses this: https://github.com/intel/llvm/blob/8213321ebb90110bf4f3d04fa0dc8e131a464a19/libclc/ptx-nvidiacl/libspirv/group/collectives.cl#L263 I note that...
Yes, I want per-call information. Similarly I'd like per-call launch information for all kernel launches. They exist for hipModuleLaunchKernel but are blank for hipLaunchKernelGGL. I'm happy to mine this out...
Yes, OK, fair enough, give m a few days.
Well, it took more than a few days. Sorry. I do see that data in the sqlite db for copies: ``` ,BeginNs,EndNs,pid,tid,Name,args,Index,Data,__section,__lane,DurationNs 9164,7148818203319432,7148818203870232,19156,19156,hipMemcpyAsync,( dst(0xf89e20) src(0x7f3480e08000) sizeBytes(3840) kind(4) stream(2)),9165,,2,19156,550800 ``` (Sorry...
I really don't know, this was so long ago. And now I work for AMD :) I will close the issue.