Kaushik Kulkarni
Kaushik Kulkarni
I see the following two approaches: 1. Rewrite the kernel launches in `pycuda.gpuarray` to guard from passing `None` as an argument, or, 2. Modify `Function.prepared_call` to accept `None` as a...
From the discussing at inducer/islpy#28, it was established that ISL's performance depends on whether it was linked against `GMP` or `imath` (shipped with islpy). Running the same kernel with islpy...
Post #149, #150 and taking suggestions from inducer/islpy#28, the codegen profile looks better: ``` 3501511 function calls (2873484 primitive calls) in 71.862 seconds Ordered by: internal time ncalls tottime percall...
Yep, that's an ISL bug: ```python >>> import islpy as isl >>> isl.BasicSet("{[i]: 0
I did not look closely at the provided kernel, but that can happen in the following case: ```python import loopy as lp knl = lp.make_kernel( "{[i, j]: 0
Passing `default_tag=None` to `add_prefetch` and parallelizing the prefetch by explicitly calling `split_inames` might help.
Also, if the workload is coming from Mirge-Com, it might be useful to evaluate if such big batched einsums are relevant. See https://github.com/illinois-ceesd/mirgecom/issues/777 for context.
On some more thought I think the current way of summing the contributions is too global memory heavy, instead storing the mapping into a single array should be more efficient:...
The dtypes of `from_element_indices` are different, unusure yet what's causing it. ```diff < from_element_indices: type: np:dtype('int32'), shape: (nelements), dim_tags: (N0:stride:1), offset: aspace: global --- > from_element_indices: type: np:dtype('int64'), shape: (nelements),...
@majosm: Thanks for the potential bottlenecks. I memoized those routines.