nequip icon indicating copy to clipboard operation
nequip copied to clipboard

Add unit test to show cpu compile error from foundation models

Open CompRhys opened this issue 7 months ago • 9 comments

See https://github.com/mir-group/nequip/issues/553

CompRhys avatar Oct 04 '25 00:10 CompRhys

CI is erroring out because the ruff linter is failing btw. Anyway, @kavanase could you please try packaging on a logic node that is CPU only to see if this fixes this problem.

cw-tan avatar Oct 07 '25 00:10 cw-tan

Thanks for the catch. I made some tweaks to the pre-commit hooks to made it consistent with the lint in CI because I was having issues with a black autoformatting loop when I initially tried to call the pre-commit to fix the lint issue I missed.

CompRhys avatar Oct 07 '25 01:10 CompRhys

Thanks! I completely forgot to update our pre-commit hooks since our migration to ruff

cw-tan avatar Oct 07 '25 01:10 cw-tan

Anyway, @kavanase could you please try packaging on a logic node that is CPU only to see if this fixes this problem Sorry for the delay! Got held up with travel.

Here's the NequIP package file from a login node without access to GPU: (too large to directly upload here; 72 Mb) https://drive.google.com/file/d/1CR2uxLanZdorZyhYuVUcKuf9Ku1rMPvR/view?usp=sharing

kavanase avatar Oct 08 '25 18:10 kavanase

Thanks! can these more portable models be uploaded to the endpoint for "nequip.net:mir-group/NequIP-OAM-L:0.1"?

CompRhys avatar Oct 08 '25 20:10 CompRhys

I'm only guessing that packaging on CPU-only devices can fix this problem, would need to check first before we update the website links, etc

cw-tan avatar Oct 08 '25 20:10 cw-tan

There was an issue before with (accelerated) GPU inference when packaged on CPU-only devices though right? @cw-tan Is that avoided now?

kavanase avatar Oct 08 '25 20:10 kavanase

When there were similar issues with MACE in the past the solution was to make the default that it always casts the model to CPU before saving regardless of device. I don't think you need to be on a machine that specifically doesn't have access to a GPU. I am not sure if the inductor stage might change or complicate any of this

CompRhys avatar Oct 08 '25 20:10 CompRhys

There was an issue before with (accelerated) GPU inference when packaged on CPU-only devices though right? @cw-tan Is that avoided now?

@kavanase Yes, that's resolved. The problem was more for Allegro, see https://github.com/mir-group/allegro/commit/1b1b2308e9ef2bdc8c84226c156ddc004f220491

When there were similar issues with MACE in the past the solution was to make the default that it always casts the model to CPU before saving regardless of device. I don't think you need to be on a machine that specifically doesn't have access to a GPU. I am not sure if the inductor stage might change or complicate any of this

@CompRhys Good point, potentially worth sending all models to CPU before packaging in https://github.com/mir-group/nequip/blob/main/nequip/scripts/package.py . The checkpoint loading is a bit automagical since we just depend on Lightning (https://github.com/mir-group/nequip/blob/2ccb4878b7b0211d90b7cc041bf1bab4dbafc67d/nequip/model/saved_models/checkpoint.py#L78), so I'm unsure if it does some magic device handling when loading. Regardless, hopefully just manual .to("cpu") in the package script is enough.

cw-tan avatar Oct 08 '25 21:10 cw-tan