Ma, Guokai
Ma, Guokai
The specific error below is because of the container is not created with CAP_SYS_NICE capability. I'll check the additional flags I use for container and post it here. ``` set_mempolicy:...
On my system docker container needs to be started with SYS_NICE capability with the following flag. ``` --cap-add SYS_NICE ``` Not sure how to turn on this for DeepSpeed runner....
A proper behavior of DeepSpeed `--bind_cores_to_rank` is only bind memory to NUMA node if system allows to. This makes DeepSpeed behave more gracefully in docker environment. The latest fix in...
Hi @loadams the blocking issue for this PR had been resolved. Can you help restart the workflow? Thanks!
@tjruwase Thanks! Currently the autotp workflow passed. One thing I'm not sure is whether the checkpoint downloaded will be preserved across different runs. This will be most time consuming part...
@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner. > @delock, it is great to see the CI now passing....
> > @mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner. > > > @delock, it is great to see...
Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.
> > Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested. > > Apologies, I was out but...
@loadams Intel Extension for Pytorch 2.2 had been released today. Restart the workflow should resolve the failure. https://pypi.org/project/intel-extension-for-pytorch/ > > > Hi @loadams can you help start the workflow? The...