mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

ROCm support

Open countradooku opened this issue 6 months ago • 17 comments

Hello @EricLBuehler

Maybe you can add rocm support here by using our rocm wrappers

https://github.com/RustNSparks/rocm-rs

I could help you

countradooku avatar May 17 '25 20:05 countradooku

Maybe you should add for candle first. we added a macro for writing kernels in rust if that's helpful

countradooku avatar May 17 '25 20:05 countradooku

@EricLBuehler I can donate an amd gpu if needed

sonicrules1234 avatar Jun 02 '25 16:06 sonicrules1234

Maybe you should add for candle first.

FWIW, Burn / CubeCL has implemented a backend for Candle (early stages), but they also provide their own alternatives for CUDA, ROCm, etc.

The ROCm support in Burn is delegated to their CubeCL project, which seems to just be using bindgen sys crate directly IIRC, so you may want to look into contributing integration of rocm-rs there if it'd make sense?

Then for mistral.rs, there might be a larger benefit of shifting to Burn (or continuing to support Candle directly 🤷‍♂) if broader platform support is more viable that way?

I definitely would be reluctant to suggest mistral.rs manage platform support on it's side when it's viable to delegate to a more suitable dependency like Candle / Burn?

polarathene avatar Jun 04 '25 23:06 polarathene

I was looking into burn yesterday, and according to the burn book, it doesn't really support quantization yet.

sonicrules1234 avatar Jun 05 '25 13:06 sonicrules1234

I definitely would be reluctant to suggest mistral.rs manage platform support on it's side when it's viable to delegate to a more suitable dependency like Candle / Burn?

Switching to burn is very attractive, but as @sonicrules1234 said they dont have great quantization support yet. Would also require quite a rework of the types to support this, but it is possible theoretically.

EricLBuehler avatar Jun 05 '25 13:06 EricLBuehler

I definitely would be reluctant to suggest mistral.rs manage platform support on it's side when it's viable to delegate to a more suitable dependency like Candle / Burn?

Switching to burn is very attractive, but as @sonicrules1234 said they dont have great quantization support yet. Would also require quite a rework of the types to support this, but it is possible theoretically.

Did you manage to play with our rocm crate?

countradooku avatar Jun 05 '25 14:06 countradooku

according to the burn book, it doesn't really support quantization yet.

It seems supported, but limited to Int8 presently.

They've been actively working on the support with a notable refactor in May. I think their plan is to add Int4 support in the near future (the refactor IIRC made improving the support much more flexible). That refactor has had the CubeCL related PR merged, but the one for Burn is still in the review process.


Switching to burn is very attractive Would also require quite a rework of the types to support this, but it is possible theoretically.

Understood. Just putting it out there if the additional backends support or other features Burn offers might justify the effort.

If there's any specific requirements mistral.rs needs that are blockers, perhaps those could be clarified? Depending on the state of their Candle backend, perhaps that'd ease a transition support wise 🤷‍♂

For reference, the burn tracking issue for quantization support is here: https://github.com/tracel-ai/burn/issues/464#issuecomment-2946555500

polarathene avatar Jun 05 '25 22:06 polarathene

Did you manage to play with our rocm crate?

@sonicrules1234 Not yet unfortunately, I've been working through some bugs and preparing for 0.6.0 (hoping to release very soon). Will look at this in the backend-expansion push we are planning for 0.7.0.

EricLBuehler avatar Jun 05 '25 23:06 EricLBuehler

If there's any specific requirements mistral.rs needs that are blockers, perhaps those could be clarified? Depending on the state of their Candle backend, perhaps that'd ease a transition support wise 🤷‍♂

@polarathene I think the main blocker I see is that the candle <> burn API itself is so divergent it would take a long time to port everything. Also, for mistral.rs, we have several sub-crates like mistralrs-quant and mistralrs-paged-attn that house quantization other specialized kernels that would take a long time to port to burn.

Ideally burn could offer some sort of FFI capability (like Candle has though CustomOp) that would allow these kernels to be reused. If that were to happen, transitioning to burn would be an amazing step - the possibility of so many backends would really open up adoption opportunities!

EricLBuehler avatar Jun 05 '25 23:06 EricLBuehler

If there's any specific requirements mistral.rs needs that are blockers, perhaps those could be clarified? Depending on the state of their Candle backend, perhaps that'd ease a transition support wise 🤷‍♂

@polarathene I think the main blocker I see is that the candle <> burn API itself is so divergent it would take a long time to port everything. Also, for mistral.rs, we have several sub-crates like mistralrs-quant and mistralrs-paged-attn that house quantization other specialized kernels that would take a long time to port to burn.

Ideally burn could offer some sort of FFI capability (like Candle has though CustomOp) that would allow these kernels to be reused. If that were to happen, transitioning to burn would be an amazing step - the possibility of so many backends would really open up adoption opportunities!

It appears that a new ROCm backend is needed in Candle. Since the HIP/ROCm runtime interface is similar to the CUDA runtime, implementing something analogous to cudarc would be preferred to simplify integration. As far as I know, the CUDA kernels used in Candle can be reused by recompiling them with the ROCm compiler. For paged attention and other fused operations not currently supported by Candle, we could use CustomOp, similar to how Metal ops are handled.

guoqingbao avatar Jun 06 '25 02:06 guoqingbao

Also, for mistral.rs, we have several sub-crates like mistralrs-quant and mistralrs-paged-attn that house quantization other specialized kernels that would take a long time to port to burn.

While it probably won't help with the porting aspect, CubeCL seems to be similar to Triton in PyTorch I think, for writing kernels that are platform agnostic (not sure how much of a perf loss that would be however) via a Rust API that JIT compiles kernels for the target at runtime.

Presently you have hand-written/maintained kernels for CUDA and Metal, whereas CubeCL supports those along with ROCm, Vulkan, and more.

Burn delegates to CubeCL for all these targets AFAIK (the separate backends depend on cubecl + burn-cubecl), so if you know how to evaluate the perf impact and that isn't a notable regression, you could then consider pros/cons of leveraging CubeCL for the kernels portion, and go from there? Porting just one of the kernels like a model quantizer would be a much more smaller scope to begin with 😅

Ideally burn could offer some sort of FFI capability (like Candle has though CustomOp) that would allow these kernels to be reused.

With the CubeCL suggestion, I'm not sure how well that would work with your existing use of Candle for such, or if it would be necessary.


EDIT: Oh I see you're already quite aware of the project 😓 Glancing over the open issues it might still be too early to try adopt (quant support aside).

There is a llama 3 model architecture example which may be useful as a reference, looks familiar to the equivalent in mistral.rs at a glance.

polarathene avatar Jun 06 '25 10:06 polarathene

I woudl still use my rocm wrapper. it is quite complete

countradooku avatar Jun 06 '25 10:06 countradooku

cubecl uses basically just hip bindings. I have wrappers for near all libraries. Now we introduced a struct ROCArray that represents a gpu array. it is a work in progress @EricLBuehler

countradooku avatar Jun 06 '25 10:06 countradooku

cubecl uses basically just hip bindings. I have wrappers for near all libraries. Now we introduced a struct ROCArray that represents a gpu array. it is a work in progress @EricLBuehler

Thanks for the update @radudiaconu0! Will find time to take a look at it after I release v0.6.0. Like @guoqingbao said ideally this could be as easy as reusing the current CUDA kernels.

While it probably won't help with the porting aspect, CubeCL seems to be similar to Triton in PyTorch I think, for writing kernels that are platform agnostic (not sure how much of a perf loss that would be however) via a Rust API that JIT compiles kernels for the target at runtime.

@polarathene yeah, it's quite exciting! I've already been playing with it on constensor.

I think the ROCm crate initially followed by potentially porting to burn could be good.

EricLBuehler avatar Jun 06 '25 11:06 EricLBuehler

@EricLBuehler yeah the kernels can be reused. maybe convert them with hipify

countradooku avatar Jun 06 '25 11:06 countradooku

It looks like burn just released a new version, which supports quantization now.

sonicrules1234 avatar Oct 28 '25 18:10 sonicrules1234

I have a 7900 XTX and a AMD RYZEN AI MAX+ 395. I can donate a RX 9070 XT and I'm ready to help.

What's the status of this? How can I help?

JMLX42 avatar Nov 19 '25 14:11 JMLX42