DeepSpeed
DeepSpeed copied to clipboard
Workflow for AutoTP
This PR add a new extendable workflow for automatic tensor parallelism (https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/). The workflow aims to provide a way to validate AutoTP for LLM models.
The specific error below is because of the container is not created with CAP_SYS_NICE capability. I'll check the additional flags I use for container and post it here.
set_mempolicy: Operation not permitted
setting membind: Operation not permitted
On my system docker container needs to be started with SYS_NICE capability with the following flag.
--cap-add SYS_NICE
Not sure how to turn on this for DeepSpeed runner.
Without this capability, we have to remove --bind_cores_to_rank
flag, but this would significantly slow down the running time of the test. @mrwyattii what's your thinking on this? We can remove --bind_cores_to_rank
to let the workflow run first, then work on how to enable SYS_NICE capability, does it work?
A proper behavior of DeepSpeed --bind_cores_to_rank
is only bind memory to NUMA node if system allows to. This makes DeepSpeed behave more gracefully in docker environment. The latest fix in DeepSpeed had been verified on my own runner, with and without SYS_NICE capability.
https://github.com/delock/DeepSpeedSYCLSupport/actions/runs/7581455004/job/20649083143
https://github.com/delock/DeepSpeedSYCLSupport/actions/runs/7581918228/job/20650446510
Hi @loadams the blocking issue for this PR had been resolved. Can you help restart the workflow? Thanks!
@tjruwase Thanks! Currently the autotp workflow passed. One thing I'm not sure is whether the checkpoint downloaded will be preserved across different runs. This will be most time consuming part of this workflow. Will need some comments (i.e. which directory in runner can preserve?) or observe another run to see whether the checkpoint preserves.
@delock, it is great to see the CI now passing.
I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.
@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.
@delock, it is great to see the CI now passing.
I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.
@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.
@delock, it is great to see the CI now passing. I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.
I know @mrwyattii and I still need to leave feedback on this PR, but an example of where things are on the blob storage here, I'm not sure that's the best example, but that's one that shows persisting a larger download/install.
@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.
@delock, it is great to see the CI now passing. I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.
I know @mrwyattii and I still need to leave feedback on this PR, but an example of where things are on the blob storage here, I'm not sure that's the best example, but that's one that shows persisting a larger download/install.
Thanks for the suggestion @loadams . By looking at the usage of '/blob' in DeepSpeed workflows. I found I need to use the default value of TRANSFORMERS_CACHE. Let me make the change and see if it persists.
Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.
Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.
Apologies, I was out but it should be running now.
Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.
Apologies, I was out but it should be running now.
Thanks! The failure in the workflow should be due to version mismatch between pytorch (2.2.0) and Intel extension for PyTorch (2.1). The recent failure in cpu-inference
workflow should also be caused by this reason. An upcoming release of Intel extension for Pytorch should fix it. Let me ping you when the new version is released.
@loadams Intel Extension for Pytorch 2.2 had been released today. Restart the workflow should resolve the failure. https://pypi.org/project/intel-extension-for-pytorch/
Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.
Apologies, I was out but it should be running now.
Thanks! The failure in the workflow should be due to version mismatch between pytorch (2.2.0) and Intel extension for PyTorch (2.1). The recent failure in
cpu-inference
workflow should also be caused by this reason. An upcoming release of Intel extension for Pytorch should fix it. Let me ping you when the new version is released.
The latest error is caused by change of command line in DeepSpeedExamples hf_compare.py. The latest commit adapt to this change.
Hi @loadams, can you help start the workflow for this PR? Thanks!
@loadams thanks for start the workflow. Now this workflow is ready for testing autotp for various popular modules. Should we continue add new models into this PR? I plan to add model for validation one by one so we have a steady growing list.
Hi @loadams I have added gptj and baichuan7b model to autotp workflow, can you help start the workflow? Thanks!
Hi @loadams I have added gptj and baichuan7b model to autotp workflow, can you help start the workflow? Thanks!
Done.
Now this workflow is ready for testing autotp for various popular modules. Should we continue add new models into this PR? I plan to add model for validation one by one so we have a steady growing list
For this, I think the concern that @mrwyattii and I had was is there any way to include this in the existing cpu-inference workflow since the setup is similar? Perhaps adding a step that only runs if the build is a scheduled build (non-PR weekly job) that then opens an issue if this workflow fails?
Hi @loadams I have added gptj and baichuan7b model to autotp workflow, can you help start the workflow? Thanks!
Done.
Now this workflow is ready for testing autotp for various popular modules. Should we continue add new models into this PR? I plan to add model for validation one by one so we have a steady growing list
For this, I think the concern that @mrwyattii and I had was is there any way to include this in the existing cpu-inference workflow since the setup is similar? Perhaps adding a step that only runs if the build is a scheduled build (non-PR weekly job) that then opens an issue if this workflow fails?
I think its doable. We can test these steps in cpu-inference workflow then guard the steps with schedule event.
Hi @loadams I have added gptj and baichuan7b model to autotp workflow, can you help start the workflow? Thanks!
Done.
Now this workflow is ready for testing autotp for various popular modules. Should we continue add new models into this PR? I plan to add model for validation one by one so we have a steady growing list
For this, I think the concern that @mrwyattii and I had was is there any way to include this in the existing cpu-inference workflow since the setup is similar? Perhaps adding a step that only runs if the build is a scheduled build (non-PR weekly job) that then opens an issue if this workflow fails?
I think its doable. We can test these steps in cpu-inference workflow then guard the steps with schedule event.
Sounds good, thanks! Just tag me when you need the workflows started.
@loadams I have moved autotp workflow to the end of cpu-inference workflow. Can you help start the workflow? Thanks!
I didn't find the reason for GPT-J failure. This model is special because it has a single big FP32 model checkpoint (24GB), probably this added too much memory pressure. I replaced it with Falcon-7b which has much smaller chunk of model checkpoint. Let's see whether this will pass. I also added a probe for /blob to understand the capacity of checkpoint cache.
@loadams Falcon 7b model is not supported by DeepSpeed AutoTP yet. I updated the workflow to test Baichuan 7b instead. Can you help restart the workflow? Thanks!
Hi @loadams the command line of baichuan model had been changed to fix the test error. The reason is Baichuan model contains remote code so need to set trust_remote_code to true. Can you help restart the workflow? Thanks!
hi @loadams @tjruwase can you help start this work flow? thanks!
Hi @loadams , I see the environment issue should have been fixed. Can you help restart the workflow? Thanks!
Hi @loadams , I see the environment issue should have been fixed. Can you help restart the workflow? Thanks!
@delock - yes, apologies that took so long.
@loadams I ran these two tests on my local environment. It didn't took so long. Can you help run this workflow again to see whether it is reproducible? Thanks!
@loadams I ran these two tests on my local environment. It didn't took so long. Can you help run this workflow again to see whether it is reproducible? Thanks!
Re-running now
Hi @loadams, I tried run these UTs in my environment and didn't see this timeout. Since CPU UT is already covered by workflow cpu-torch-latest
. I removed unit tests in this workflow and focus on AutoTP test only. I also removed dependency on oneCCL and use stock pytorch to better focus on AutoTP functionality. Can you help start the workflow? Thanks!
Hi @loadams, I tried run these UTs in my environment and didn't see this timeout. Since CPU UT is already covered by workflow
cpu-torch-latest
. I removed unit tests in this workflow and focus on AutoTP test only. I also removed dependency on oneCCL and use stock pytorch to better focus on AutoTP functionality. Can you help start the workflow? Thanks!
Done