phantaurus comments

Results 8 comments of


                                            phantaurus

[QST] Build for sm100 Blackwell GPUs

@thakkarV Thank you so much for your reply! What about DRIVE Thor which has a compute capability of 10.1? Will they be supported along with sm120a?

Softmax (particularly exp operations) becomes a major bottleneck in full FP16 pipeline

Thank you so much for your response! I recently read the FA3 paper. I guess GPUs like Orin do not have async MMA operations. Do you think the ping-pong structure...

cudaMalloc called inside SpconvOps::get_indice_pairs_implicit_gemm

I think the problem is here: `src/spconvlib/spconv/csrc/sparse/alloc/StaticAllocator/StaticAllocator_empty.cc` Even though we've defined the StaticAllocator, Thrust still use dynamic allocation. Could you help me understand why this is the case? I think...

Pipelining GmemCopy on kHeadDim

Thank you so much for your reply! I think you're describing an ideal asynchronous scenario: If the system has sufficient parallelism, one pipeline—whether compute or memory—becomes saturated, while the other...

Pipelining GmemCopy on kHeadDim

I have confirmed that memory copies, whether from global memory (GmemCopy) or shared memory (SmemCpy), do not significantly impact the 50% TensorCore Active %. I removed all data copying operations,...

Pipelining GmemCopy on kHeadDim

Ah, I see. I am measuring based on Max FP16 TFLOPS. The numbers make a lot more sense now. I guess we have to use FP32 for softmax, so achieving...

[QST] Get a slice of the Tensor while keeping the dimension

Thank you so much for your reply! I have experimented with partitions: ``` Tensor acc = make_tensor(make_layout(Shape{}, Stride{})); Layout acc_layout = acc.layout(); Layout acc_tiler = make_layout(get(acc_layout), Layout{}, get(acc_layout)); Tensor partitioned_acc...

[QST] Get a slice of the Tensor while keeping the dimension

Another question is ``` Layout acc_tiler_new = make_layout(get(acc_layout), Layout{},Layout{}); Tensor partitioned_acc_new = logical_divide(acc, acc_tiler_new); ``` acc_tiler_new is ((_2,_2),(_1),(_1)):((_1,_2),(_0),(_0)) partitioned_acc_new is o (((_2,_2),(_1),(_1)),_32):(((_1,_2),(_0),(_0)),_4) However, I would expect partitioned_acc_new to be o...