Results 21 comments of nicolov

Clang 9 now includes a `clang-ifso` tool to create such stubs. Would it be easy to use it in Bazel? https://www.phoronix.com/scan.php?page=news_item&px=Clang-Interface-Stubs

I've set it up in Bazel (was fairly straightforward) but the problem is that the clang tool produces an .ifso file for each `.o` file, and not for each `.so`....

Sadly, the feature is not really useful right now. First of all, the interface stubs implementation (in Clang 9, haven't checked trunk yet) crashes even with the simplest C++ example:...

Hello, thank you very much for using this! Unfortunately the code is now quite old, so you might have to, eg, google `logsumexp` and see where it was moved to...

Hey Micah, nice to hear from you. Yes, I'd be happy to merge one. Also keep in mind that this repo is substantially different from what you saw at Cruise....

@awni I tried to apply your comments and pushed https://github.com/ml-explore/mlx/pull/975/commits/501b889de904adf3b60f4b19d16b3ee46bdc5db2 to avoid creating a new command buffer for each kernel, but I get: ``` -[AGXG13XFamilyCommandBuffer tryCoalescingPreviousComputeCommandEncoderWithConfig:nextEncoderClass:]:1015: failed assertion `A command...

> Rather call the `device.get_command_encoder()` to get the active encoder. I also tried doing that in https://github.com/ml-explore/mlx/pull/975/commits/b979ccf3e4051b99320e5d69abe138359c9f0660 which just produces the wrong result.

I also tried tracing and XCode complains about redundant bindings. Should I somehow refactor how I bind buffers to the encoder?

I fixed the code (needed to introduce one more kernel to ensure the atomics were synchronized properly across different threadgroups). It's a bit slow, so I'll try to improve it...

I compared with Pytorch, which can use either magma or cusolver. I used a cloud machine with A100 and a desktop with a 1050 (which should be in the ballpark...