Add unit test to show cpu compile error from foundation models
See https://github.com/mir-group/nequip/issues/553
CI is erroring out because the ruff linter is failing btw. Anyway, @kavanase could you please try packaging on a logic node that is CPU only to see if this fixes this problem.
Thanks for the catch. I made some tweaks to the pre-commit hooks to made it consistent with the lint in CI because I was having issues with a black autoformatting loop when I initially tried to call the pre-commit to fix the lint issue I missed.
Thanks! I completely forgot to update our pre-commit hooks since our migration to ruff
Anyway, @kavanase could you please try packaging on a logic node that is CPU only to see if this fixes this problem Sorry for the delay! Got held up with travel.
Here's the NequIP package file from a login node without access to GPU: (too large to directly upload here; 72 Mb)
https://drive.google.com/file/d/1CR2uxLanZdorZyhYuVUcKuf9Ku1rMPvR/view?usp=sharing
Thanks! can these more portable models be uploaded to the endpoint for "nequip.net:mir-group/NequIP-OAM-L:0.1"?
I'm only guessing that packaging on CPU-only devices can fix this problem, would need to check first before we update the website links, etc
There was an issue before with (accelerated) GPU inference when packaged on CPU-only devices though right? @cw-tan Is that avoided now?
When there were similar issues with MACE in the past the solution was to make the default that it always casts the model to CPU before saving regardless of device. I don't think you need to be on a machine that specifically doesn't have access to a GPU. I am not sure if the inductor stage might change or complicate any of this
There was an issue before with (accelerated) GPU inference when packaged on CPU-only devices though right? @cw-tan Is that avoided now?
@kavanase Yes, that's resolved. The problem was more for Allegro, see https://github.com/mir-group/allegro/commit/1b1b2308e9ef2bdc8c84226c156ddc004f220491
When there were similar issues with MACE in the past the solution was to make the default that it always casts the model to CPU before saving regardless of device. I don't think you need to be on a machine that specifically doesn't have access to a GPU. I am not sure if the inductor stage might change or complicate any of this
@CompRhys Good point, potentially worth sending all models to CPU before packaging in https://github.com/mir-group/nequip/blob/main/nequip/scripts/package.py . The checkpoint loading is a bit automagical since we just depend on Lightning (https://github.com/mir-group/nequip/blob/2ccb4878b7b0211d90b7cc041bf1bab4dbafc67d/nequip/model/saved_models/checkpoint.py#L78), so I'm unsure if it does some magic device handling when loading. Regardless, hopefully just manual .to("cpu") in the package script is enough.