qazal

Results 187 comments of qazal

yea it was the compiler ![image](https://github.com/user-attachments/assets/f9f376cd-f2aa-4610-9549-31e5e3464361) ![image](https://github.com/user-attachments/assets/43a29411-f43b-4343-821c-7edb58178384)

cool, I'll review that. @0xtimmy what are the other (correctness) blockers?

hmm, I'm getting: ``` test/test_linearizer.py::TestLinearizer::test_double_reduce_multireduce - RuntimeError: OpenCL Error -54: CL_INVALID_WORK_GROUP_SIZE ``` running on M1 with GPU=1

so to me this speedup doesn't look worth the complexity. If you really want this, can you think of the simplest way to express it? IIRC this doesn't impact correctness.

I'd continue iterating on a simpler fix. Maybe this can be a graph rewrite rule, like `fold_reduce_dims` - or even an OptOp if that level of abstractions makes sense. (ofc,...

bounty locked, good luck! Your test output is correct but the implementation can be simpler, Can you rewrite all gated stores with IF? There shouldn't be two paths for gated...

> the prioritization/ordering of things could happen within a rewrite rule? I think if you do the graph rewriting right you don't need to worry about the queue.

I can't repro the AMD issue on a real chip, it might be an emulator bug. I'll look into that one.

The diff looks very close! Some things I noticed: 1. This isn't wrapping the second reduceop in the IF for multi reduce. It should, LOAD is expensive. `TestLinearizer.test_double_reduce_multireduce` ![image](https://github.com/user-attachments/assets/4f47112f-8443-438f-8939-c3ab1dd44e9a) 2....

@ianpaul10 you can access a green box for that, sent info on discord.