Minh Quan Ho

Results 45 comments of Minh Quan Ho

Ping. Is there any news on this ? I recently encountered the same excessively long compile time with an [OpenCL kernel of OpenCV](https://github.com/opencv/opencv/blob/4.5.5/modules/objdetect/src/opencl/objdetect_hog.cl). It took ~20sec to compile and has...

For exporting to img, dd can do the stuff : ``` (sudo) dd if= of=/path/to/your/img bs=1M ```

> Some of this will naturally be addressed when @devinamatthews obviates the need for the bli_gemm_int() function, which is on his docket +1 @devinamatthews As I can see, there is...

> @hominhquan thanks for your analysis. In practice, `BLIS_IR_NT` is always 1, as threading this loop just doesn't make sense on any architecture we've seen. Without diving into the details...

I give an example of mis-balancing : MC = NC = 256, MR = 8, NR = 16, M = N = 3000 Possible edge-macro-block is (`3000 % 256`): 184-by-184...

> OK. Do you have any performance numbers? It sounds like we can improve general parallel performance then. For example, on the Kalray MPPA3 processor, I get a speedup x1.3...

In fact, we can re-use the current slab/rr dispatch, but on the fused workspace: ``` /* construct a fused JR/IR thread_info */ thrinfo_t thread_jrir = ... ; bli_thread_range_jrir( &thread_jrir, n_iter...

@devinamatthews Yes, but the full condition should be: - `BLIS_JR_NT` ideally be divisor of `BLIS_NC/BLIS_NR`, and - `BLIS_IR_NT` ideally be divisor of `BLIS_MC/BLIS_MR` > BLIS_JR_NT also shouldn't be very large,...

@fgvanzee You are right, this is, functionality-speaking, not a bug, but a sub-optimiality in micro-kernels dispatch. As you said, it is possible to have idle threads not doing any computation.

> The main thread should check out a block from the existing pool for C, which is of size (nt_pc-1)*m*n and broadcast to the other threads. Since the main thread...