Ray Gao

Results 16 comments of Ray Gao

W&B step count warning is normal because we only resume at the last checkpoint, ie: latest checkpoint at step 5000, job crashed at 5100, then it should resume at step...

for CUDA_VISIBLE_DEVICES to work you need to have it in the slurm environment, we dont use this right now and instead just assign the device for the rank using [torch.cuda.set_device](https://github.com/facebookresearch/fairchem/blob/abcdf661926ce13e9d8cf3fe1d4484e58780a1fd/src/fairchem/core/common/distutils.py#L244)...

Hi yes, in finetuning, we remove all the heads and re-initialize them from scratch (the weights from the backbone are retained) hence the accuracy will be lower, but it should...

you can, this is basically just the same as continuing to train the same model, it would mean that your data needs to have the same level of DFT theory...

to train the full model with all the heads, the easiest way is to train the original uma itself with the following yaml (if you are using uma-s): https://github.com/facebookresearch/fairchem/blob/main/configs/uma/training_release/uma_sm_conserve_finetune.yaml ie:...

We would have the model artifact and the exact commit (or version) of the source code to run. The issue right now we're not tieing the model to the code...