phantaurus
phantaurus
@thakkarV Thank you so much for your reply! What about DRIVE Thor which has a compute capability of 10.1? Will they be supported along with sm120a?
Thank you so much for your response! I recently read the FA3 paper. I guess GPUs like Orin do not have async MMA operations. Do you think the ping-pong structure...
I think the problem is here: `src/spconvlib/spconv/csrc/sparse/alloc/StaticAllocator/StaticAllocator_empty.cc` Even though we've defined the StaticAllocator, Thrust still use dynamic allocation. Could you help me understand why this is the case? I think...
Thank you so much for your reply! I think you're describing an ideal asynchronous scenario: If the system has sufficient parallelism, one pipeline—whether compute or memory—becomes saturated, while the other...
I have confirmed that memory copies, whether from global memory (GmemCopy) or shared memory (SmemCpy), do not significantly impact the 50% TensorCore Active %. I removed all data copying operations,...
Ah, I see. I am measuring based on Max FP16 TFLOPS. The numbers make a lot more sense now. I guess we have to use FP32 for softmax, so achieving...
Thank you so much for your reply! I have experimented with partitions: ``` Tensor acc = make_tensor(make_layout(Shape{}, Stride{})); Layout acc_layout = acc.layout(); Layout acc_tiler = make_layout(get(acc_layout), Layout{}, get(acc_layout)); Tensor partitioned_acc...
Another question is ``` Layout acc_tiler_new = make_layout(get(acc_layout), Layout{},Layout{}); Tensor partitioned_acc_new = logical_divide(acc, acc_tiler_new); ``` acc_tiler_new is ((_2,_2),(_1),(_1)):((_1,_2),(_0),(_0)) partitioned_acc_new is o (((_2,_2),(_1),(_1)),_32):(((_1,_2),(_0),(_0)),_4) However, I would expect partitioned_acc_new to be o...