DeepSpeed Workflow for AutoTP

This PR add a new extendable workflow for automatic tensor parallelism (https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/). The workflow aims to provide a way to validate AutoTP for LLM models.

Jan 16 '24 10:01 delock

The specific error below is because of the container is not created with CAP_SYS_NICE capability. I'll check the additional flags I use for container and post it here.

set_mempolicy: Operation not permitted
setting membind: Operation not permitted

Jan 17 '24 03:01 delock

On my system docker container needs to be started with SYS_NICE capability with the following flag.

  --cap-add SYS_NICE

Not sure how to turn on this for DeepSpeed runner.

Without this capability, we have to remove --bind_cores_to_rank flag, but this would significantly slow down the running time of the test. @mrwyattii what's your thinking on this? We can remove --bind_cores_to_rank to let the workflow run first, then work on how to enable SYS_NICE capability, does it work?

Jan 17 '24 12:01 delock

A proper behavior of DeepSpeed --bind_cores_to_rank is only bind memory to NUMA node if system allows to. This makes DeepSpeed behave more gracefully in docker environment. The latest fix in DeepSpeed had been verified on my own runner, with and without SYS_NICE capability. https://github.com/delock/DeepSpeedSYCLSupport/actions/runs/7581455004/job/20649083143 https://github.com/delock/DeepSpeedSYCLSupport/actions/runs/7581918228/job/20650446510

Jan 19 '24 09:01 delock

Hi @loadams the blocking issue for this PR had been resolved. Can you help restart the workflow? Thanks!

Jan 22 '24 02:01 delock

@tjruwase Thanks! Currently the autotp workflow passed. One thing I'm not sure is whether the checkpoint downloaded will be preserved across different runs. This will be most time consuming part of this workflow. Will need some comments (i.e. which directory in runner can preserve?) or observe another run to see whether the checkpoint preserves.

Jan 22 '24 07:01 delock

@delock, it is great to see the CI now passing.

I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

Jan 22 '24 17:01 tjruwase

@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.

@delock, it is great to see the CI now passing.

I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

Jan 24 '24 01:01 delock

@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.

@delock, it is great to see the CI now passing. I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

I know @mrwyattii and I still need to leave feedback on this PR, but an example of where things are on the blob storage here, I'm not sure that's the best example, but that's one that shows persisting a larger download/install.

Jan 24 '24 16:01 loadams

@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.

@delock, it is great to see the CI now passing. I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

I know @mrwyattii and I still need to leave feedback on this PR, but an example of where things are on the blob storage here, I'm not sure that's the best example, but that's one that shows persisting a larger download/install.

Thanks for the suggestion @loadams . By looking at the usage of '/blob' in DeepSpeed workflows. I found I need to use the default value of TRANSFORMERS_CACHE. Let me make the change and see if it persists.

Jan 29 '24 05:01 delock

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

Jan 31 '24 03:01 delock

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

Apologies, I was out but it should be running now.

Feb 05 '24 18:02 loadams

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

Apologies, I was out but it should be running now.

Thanks! The failure in the workflow should be due to version mismatch between pytorch (2.2.0) and Intel extension for PyTorch (2.1). The recent failure in cpu-inference workflow should also be caused by this reason. An upcoming release of Intel extension for Pytorch should fix it. Let me ping you when the new version is released.

Feb 06 '24 06:02 delock

@loadams Intel Extension for Pytorch 2.2 had been released today. Restart the workflow should resolve the failure. https://pypi.org/project/intel-extension-for-pytorch/

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

Apologies, I was out but it should be running now.

Thanks! The failure in the workflow should be due to version mismatch between pytorch (2.2.0) and Intel extension for PyTorch (2.1). The recent failure in cpu-inference workflow should also be caused by this reason. An upcoming release of Intel extension for Pytorch should fix it. Let me ping you when the new version is released.

Feb 06 '24 14:02 delock

The latest error is caused by change of command line in DeepSpeedExamples hf_compare.py. The latest commit adapt to this change.

Feb 07 '24 06:02 delock

Hi @loadams, can you help start the workflow for this PR? Thanks!

Feb 15 '24 13:02 delock

@loadams thanks for start the workflow. Now this workflow is ready for testing autotp for various popular modules. Should we continue add new models into this PR? I plan to add model for validation one by one so we have a steady growing list.

Feb 19 '24 02:02 delock

Hi @loadams I have added gptj and baichuan7b model to autotp workflow, can you help start the workflow? Thanks!

Feb 22 '24 05:02 delock

Hi @loadams I have added gptj and baichuan7b model to autotp workflow, can you help start the workflow? Thanks!

Done.

Now this workflow is ready for testing autotp for various popular modules. Should we continue add new models into this PR? I plan to add model for validation one by one so we have a steady growing list

For this, I think the concern that @mrwyattii and I had was is there any way to include this in the existing cpu-inference workflow since the setup is similar? Perhaps adding a step that only runs if the build is a scheduled build (non-PR weekly job) that then opens an issue if this workflow fails?

Feb 22 '24 16:02 loadams

Hi @loadams I have added gptj and baichuan7b model to autotp workflow, can you help start the workflow? Thanks!

Done.

Now this workflow is ready for testing autotp for various popular modules. Should we continue add new models into this PR? I plan to add model for validation one by one so we have a steady growing list

For this, I think the concern that @mrwyattii and I had was is there any way to include this in the existing cpu-inference workflow since the setup is similar? Perhaps adding a step that only runs if the build is a scheduled build (non-PR weekly job) that then opens an issue if this workflow fails?

I think its doable. We can test these steps in cpu-inference workflow then guard the steps with schedule event.

Feb 23 '24 02:02 delock

Hi @loadams I have added gptj and baichuan7b model to autotp workflow, can you help start the workflow? Thanks!

Done.

Now this workflow is ready for testing autotp for various popular modules. Should we continue add new models into this PR? I plan to add model for validation one by one so we have a steady growing list

For this, I think the concern that @mrwyattii and I had was is there any way to include this in the existing cpu-inference workflow since the setup is similar? Perhaps adding a step that only runs if the build is a scheduled build (non-PR weekly job) that then opens an issue if this workflow fails?

I think its doable. We can test these steps in cpu-inference workflow then guard the steps with schedule event.

Sounds good, thanks! Just tag me when you need the workflows started.

Feb 23 '24 16:02 loadams

@loadams I have moved autotp workflow to the end of cpu-inference workflow. Can you help start the workflow? Thanks!

I didn't find the reason for GPT-J failure. This model is special because it has a single big FP32 model checkpoint (24GB), probably this added too much memory pressure. I replaced it with Falcon-7b which has much smaller chunk of model checkpoint. Let's see whether this will pass. I also added a probe for /blob to understand the capacity of checkpoint cache.

Feb 26 '24 03:02 delock

@loadams Falcon 7b model is not supported by DeepSpeed AutoTP yet. I updated the workflow to test Baichuan 7b instead. Can you help restart the workflow? Thanks!

Feb 28 '24 03:02 delock

Hi @loadams the command line of baichuan model had been changed to fix the test error. The reason is Baichuan model contains remote code so need to set trust_remote_code to true. Can you help restart the workflow? Thanks!

Mar 13 '24 14:03 delock

hi @loadams @tjruwase can you help start this work flow? thanks!

Mar 16 '24 02:03 delock

Hi @loadams , I see the environment issue should have been fixed. Can you help restart the workflow? Thanks!

Mar 28 '24 03:03 delock

Hi @loadams , I see the environment issue should have been fixed. Can you help restart the workflow? Thanks!

@delock - yes, apologies that took so long.

Mar 28 '24 16:03 loadams

@loadams I ran these two tests on my local environment. It didn't took so long. Can you help run this workflow again to see whether it is reproducible? Thanks!

Apr 01 '24 06:04 delock

@loadams I ran these two tests on my local environment. It didn't took so long. Can you help run this workflow again to see whether it is reproducible? Thanks!

Re-running now

Apr 01 '24 16:04 loadams

Hi @loadams, I tried run these UTs in my environment and didn't see this timeout. Since CPU UT is already covered by workflow cpu-torch-latest. I removed unit tests in this workflow and focus on AutoTP test only. I also removed dependency on oneCCL and use stock pytorch to better focus on AutoTP functionality. Can you help start the workflow? Thanks!

Apr 08 '24 03:04 delock

Hi @loadams, I tried run these UTs in my environment and didn't see this timeout. Since CPU UT is already covered by workflow cpu-torch-latest. I removed unit tests in this workflow and focus on AutoTP test only. I also removed dependency on oneCCL and use stock pytorch to better focus on AutoTP functionality. Can you help start the workflow? Thanks!

Done

Apr 08 '24 15:04 loadams

DeepSpeed DeepSpeed copied to clipboard

Workflow for AutoTP

DeepSpeed
DeepSpeed copied to clipboard