DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Workflow for AutoTP

Open delock opened this issue 1 year ago • 34 comments

This PR add a new extendable workflow for automatic tensor parallelism (https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/). The workflow aims to provide a way to validate AutoTP for LLM models.

delock avatar Jan 16 '24 10:01 delock

The specific error below is because of the container is not created with CAP_SYS_NICE capability. I'll check the additional flags I use for container and post it here.

set_mempolicy: Operation not permitted
setting membind: Operation not permitted

delock avatar Jan 17 '24 03:01 delock

On my system docker container needs to be started with SYS_NICE capability with the following flag.

  --cap-add SYS_NICE

Not sure how to turn on this for DeepSpeed runner.

Without this capability, we have to remove --bind_cores_to_rank flag, but this would significantly slow down the running time of the test. @mrwyattii what's your thinking on this? We can remove --bind_cores_to_rank to let the workflow run first, then work on how to enable SYS_NICE capability, does it work?

delock avatar Jan 17 '24 12:01 delock

A proper behavior of DeepSpeed --bind_cores_to_rank is only bind memory to NUMA node if system allows to. This makes DeepSpeed behave more gracefully in docker environment. The latest fix in DeepSpeed had been verified on my own runner, with and without SYS_NICE capability. https://github.com/delock/DeepSpeedSYCLSupport/actions/runs/7581455004/job/20649083143 https://github.com/delock/DeepSpeedSYCLSupport/actions/runs/7581918228/job/20650446510

delock avatar Jan 19 '24 09:01 delock

Hi @loadams the blocking issue for this PR had been resolved. Can you help restart the workflow? Thanks!

delock avatar Jan 22 '24 02:01 delock

@tjruwase Thanks! Currently the autotp workflow passed. One thing I'm not sure is whether the checkpoint downloaded will be preserved across different runs. This will be most time consuming part of this workflow. Will need some comments (i.e. which directory in runner can preserve?) or observe another run to see whether the checkpoint preserves.

delock avatar Jan 22 '24 07:01 delock

@delock, it is great to see the CI now passing.

I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

tjruwase avatar Jan 22 '24 17:01 tjruwase

@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.

@delock, it is great to see the CI now passing.

I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

delock avatar Jan 24 '24 01:01 delock

@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.

@delock, it is great to see the CI now passing. I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

I know @mrwyattii and I still need to leave feedback on this PR, but an example of where things are on the blob storage here, I'm not sure that's the best example, but that's one that shows persisting a larger download/install.

loadams avatar Jan 24 '24 16:01 loadams

@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.

@delock, it is great to see the CI now passing. I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

I know @mrwyattii and I still need to leave feedback on this PR, but an example of where things are on the blob storage here, I'm not sure that's the best example, but that's one that shows persisting a larger download/install.

Thanks for the suggestion @loadams . By looking at the usage of '/blob' in DeepSpeed workflows. I found I need to use the default value of TRANSFORMERS_CACHE. Let me make the change and see if it persists.

delock avatar Jan 29 '24 05:01 delock

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

delock avatar Jan 31 '24 03:01 delock

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

Apologies, I was out but it should be running now.

loadams avatar Feb 05 '24 18:02 loadams

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

Apologies, I was out but it should be running now.

Thanks! The failure in the workflow should be due to version mismatch between pytorch (2.2.0) and Intel extension for PyTorch (2.1). The recent failure in cpu-inference workflow should also be caused by this reason. An upcoming release of Intel extension for Pytorch should fix it. Let me ping you when the new version is released.

delock avatar Feb 06 '24 06:02 delock

@loadams Intel Extension for Pytorch 2.2 had been released today. Restart the workflow should resolve the failure. https://pypi.org/project/intel-extension-for-pytorch/

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

Apologies, I was out but it should be running now.

Thanks! The failure in the workflow should be due to version mismatch between pytorch (2.2.0) and Intel extension for PyTorch (2.1). The recent failure in cpu-inference workflow should also be caused by this reason. An upcoming release of Intel extension for Pytorch should fix it. Let me ping you when the new version is released.

delock avatar Feb 06 '24 14:02 delock

The latest error is caused by change of command line in DeepSpeedExamples hf_compare.py. The latest commit adapt to this change.

delock avatar Feb 07 '24 06:02 delock

Hi @loadams, can you help start the workflow for this PR? Thanks!

delock avatar Feb 15 '24 13:02 delock

@loadams thanks for start the workflow. Now this workflow is ready for testing autotp for various popular modules. Should we continue add new models into this PR? I plan to add model for validation one by one so we have a steady growing list.

delock avatar Feb 19 '24 02:02 delock

Hi @loadams I have added gptj and baichuan7b model to autotp workflow, can you help start the workflow? Thanks!

delock avatar Feb 22 '24 05:02 delock

Hi @loadams I have added gptj and baichuan7b model to autotp workflow, can you help start the workflow? Thanks!

Done.

Now this workflow is ready for testing autotp for various popular modules. Should we continue add new models into this PR? I plan to add model for validation one by one so we have a steady growing list

For this, I think the concern that @mrwyattii and I had was is there any way to include this in the existing cpu-inference workflow since the setup is similar? Perhaps adding a step that only runs if the build is a scheduled build (non-PR weekly job) that then opens an issue if this workflow fails?

loadams avatar Feb 22 '24 16:02 loadams

Hi @loadams I have added gptj and baichuan7b model to autotp workflow, can you help start the workflow? Thanks!

Done.

Now this workflow is ready for testing autotp for various popular modules. Should we continue add new models into this PR? I plan to add model for validation one by one so we have a steady growing list

For this, I think the concern that @mrwyattii and I had was is there any way to include this in the existing cpu-inference workflow since the setup is similar? Perhaps adding a step that only runs if the build is a scheduled build (non-PR weekly job) that then opens an issue if this workflow fails?

I think its doable. We can test these steps in cpu-inference workflow then guard the steps with schedule event.

delock avatar Feb 23 '24 02:02 delock

Hi @loadams I have added gptj and baichuan7b model to autotp workflow, can you help start the workflow? Thanks!

Done.

Now this workflow is ready for testing autotp for various popular modules. Should we continue add new models into this PR? I plan to add model for validation one by one so we have a steady growing list

For this, I think the concern that @mrwyattii and I had was is there any way to include this in the existing cpu-inference workflow since the setup is similar? Perhaps adding a step that only runs if the build is a scheduled build (non-PR weekly job) that then opens an issue if this workflow fails?

I think its doable. We can test these steps in cpu-inference workflow then guard the steps with schedule event.

Sounds good, thanks! Just tag me when you need the workflows started.

loadams avatar Feb 23 '24 16:02 loadams

@loadams I have moved autotp workflow to the end of cpu-inference workflow. Can you help start the workflow? Thanks!

I didn't find the reason for GPT-J failure. This model is special because it has a single big FP32 model checkpoint (24GB), probably this added too much memory pressure. I replaced it with Falcon-7b which has much smaller chunk of model checkpoint. Let's see whether this will pass. I also added a probe for /blob to understand the capacity of checkpoint cache.

delock avatar Feb 26 '24 03:02 delock

@loadams Falcon 7b model is not supported by DeepSpeed AutoTP yet. I updated the workflow to test Baichuan 7b instead. Can you help restart the workflow? Thanks!

delock avatar Feb 28 '24 03:02 delock

Hi @loadams the command line of baichuan model had been changed to fix the test error. The reason is Baichuan model contains remote code so need to set trust_remote_code to true. Can you help restart the workflow? Thanks!

delock avatar Mar 13 '24 14:03 delock

hi @loadams @tjruwase can you help start this work flow? thanks!

delock avatar Mar 16 '24 02:03 delock

Hi @loadams , I see the environment issue should have been fixed. Can you help restart the workflow? Thanks!

delock avatar Mar 28 '24 03:03 delock

Hi @loadams , I see the environment issue should have been fixed. Can you help restart the workflow? Thanks!

@delock - yes, apologies that took so long.

loadams avatar Mar 28 '24 16:03 loadams

@loadams I ran these two tests on my local environment. It didn't took so long. Can you help run this workflow again to see whether it is reproducible? Thanks!

delock avatar Apr 01 '24 06:04 delock

@loadams I ran these two tests on my local environment. It didn't took so long. Can you help run this workflow again to see whether it is reproducible? Thanks!

Re-running now

loadams avatar Apr 01 '24 16:04 loadams

Hi @loadams, I tried run these UTs in my environment and didn't see this timeout. Since CPU UT is already covered by workflow cpu-torch-latest. I removed unit tests in this workflow and focus on AutoTP test only. I also removed dependency on oneCCL and use stock pytorch to better focus on AutoTP functionality. Can you help start the workflow? Thanks!

delock avatar Apr 08 '24 03:04 delock

Hi @loadams, I tried run these UTs in my environment and didn't see this timeout. Since CPU UT is already covered by workflow cpu-torch-latest. I removed unit tests in this workflow and focus on AutoTP test only. I also removed dependency on oneCCL and use stock pytorch to better focus on AutoTP functionality. Can you help start the workflow? Thanks!

Done

loadams avatar Apr 08 '24 15:04 loadams