Kaushik Kulkarni comments

Results 91 comments of


                                            Kaushik Kulkarni

```neg``` failing for empty arrays

I see the following two approaches: 1. Rewrite the kernel launches in `pycuda.gpuarray` to guard from passing `None` as an argument, or, 2. Modify `Function.prepared_call` to accept `None` as a...

Codegen slow due to excessive islpy calls

From the discussing at inducer/islpy#28, it was established that ISL's performance depends on whether it was linked against `GMP` or `imath` (shipped with islpy). Running the same kernel with islpy...

Codegen slow due to excessive islpy calls

Post #149, #150 and taking suggestions from inducer/islpy#28, the codegen profile looks better: ``` 3501511 function calls (2873484 primitive calls) in 71.862 seconds Ordered by: internal time ncalls tottime percall...

Domain parsing fail

Yep, that's an ISL bug: ```python >>> import islpy as isl >>> isl.BasicSet("{[i]: 0

duplicaing an iname results in an unschedulable kernel

I did not look closely at the provided kernel, but that can happen in the following case: ```python import loopy as lp knl = lp.make_kernel( "{[i, j]: 0

Poor scaling with many calls to add_prefetch

Passing `default_tag=None` to `add_prefetch` and parallelizing the prefetch by explicitly calling `split_inames` might help.

Poor scaling with many calls to add_prefetch

Also, if the workload is coming from Mirge-Com, it might be useful to evaluate if such big batched einsums are relevant. See https://github.com/illinois-ceesd/mirgecom/issues/777 for context.

[Direct Connection] Group Contributions (probably) should not be summed

On some more thought I think the current way of summing the contributions is too global memory heavy, instead storing the mapping into a single array should be more efficient:...

Two `resample_by_picking` in simple-dg TranslationUnit

The dtypes of `from_element_indices` are different, unusure yet what's causing it. ```diff < from_element_indices: type: np:dtype('int32'), shape: (nelements), dim_tags: (N0:stride:1), offset: aspace: global --- > from_element_indices: type: np:dtype('int64'), shape: (nelements),...

Implements a Loop Fusion Transformation

@majosm: Thanks for the potential bottlenecks. I memoized those routines.