qazal comments

Results 187 comments of


                                            qazal

Lowerer Multireduce Grouping

yea it was the compiler ![image](https://github.com/user-attachments/assets/f9f376cd-f2aa-4610-9549-31e5e3464361) ![image](https://github.com/user-attachments/assets/43a29411-f43b-4343-821c-7edb58178384)

Lowerer Multireduce Grouping

cool, I'll review that. @0xtimmy what are the other (correctness) blockers?

Lowerer lidx reuse

hmm, I'm getting: ``` test/test_linearizer.py::TestLinearizer::test_double_reduce_multireduce - RuntimeError: OpenCL Error -54: CL_INVALID_WORK_GROUP_SIZE ``` running on M1 with GPU=1

Lowerer lidx reuse

so to me this speedup doesn't look worth the complexity. If you really want this, can you think of the simplest way to express it? IIRC this doesn't impact correctness.

Lowerer lidx reuse

I'd continue iterating on a simpler fix. Maybe this can be a graph rewrite rule, like `fold_reduce_dims` - or even an OptOp if that level of abstractions makes sense. (ofc,...

gated store rewrite to UOps.IF

bounty locked, good luck! Your test output is correct but the implementation can be simpler, Can you rewrite all gated stores with IF? There shouldn't be two paths for gated...

gated store rewrite to UOps.IF

> the prioritization/ordering of things could happen within a rewrite rule? I think if you do the graph rewriting right you don't need to worry about the queue.

gated store rewrite to UOps.IF

I can't repro the AMD issue on a real chip, it might be an emulator bug. I'll look into that one.

gated store rewrite to UOps.IF

The diff looks very close! Some things I noticed: 1. This isn't wrapping the second reduceop in the IF for multi reduce. It should, LOAD is expensive. `TestLinearizer.test_double_reduce_multireduce` ![image](https://github.com/user-attachments/assets/4f47112f-8443-438f-8939-c3ab1dd44e9a) 2....

gated store rewrite to UOps.IF

@ianpaul10 you can access a green box for that, sent info on discord.