Logan Adams
Logan Adams
> Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested. Apologies, I was out but it should be...
> Hi @loadams I have added gptj and baichuan7b model to autotp workflow, can you help start the workflow? Thanks! Done. > Now this workflow is ready for testing autotp...
> > > Hi @loadams I have added gptj and baichuan7b model to autotp workflow, can you help start the workflow? Thanks! > > > > > > Done. >...
> Hi @loadams , I see the environment issue should have been fixed. Can you help restart the workflow? Thanks! @delock - yes, apologies that took so long.
> @loadams I ran these two tests on my local environment. It didn't took so long. Can you help run this workflow again to see whether it is reproducible? Thanks!...
> Hi @loadams, I tried run these UTs in my environment and didn't see this timeout. Since CPU UT is already covered by workflow `cpu-torch-latest`. I removed unit tests in...
cc: @jithunnair-amd and @rraminen - new issue opened because we closed the previous one. Once we merge the ROCm update to 5.6 PR I believe there are still failing tests,...
Hi @annopackage - can you share a full minimal repro script with us please?
@alvieirajr - were you able to validate that swapping these resolved your issues?
@liuhui0401 - this seems like a cuda error, or a bad state that the GPUs are in. If you power cycle the machine, does nvidia-smi work?