DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[Bug fix] Fix cpu inference UT failure

Open delock opened this issue 2 years ago • 43 comments

This PR fix UT test error as described in this PR and the following test job. This PR skips TestModelTask if dtype is not supported by accelerator, or InferenceBuilder is not implemented by accelerator. https://github.com/microsoft/DeepSpeed/pull/4419 https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538

delock avatar Oct 01 '23 06:10 delock

FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher.

loadams avatar Oct 02 '23 18:10 loadams

Hi @loadams , currently we are seeking higher test coverage in the area of AutoTP. @Yejing-Lai is currently investigating whether more model coverage is possible. On the other hand. Some tests skipped because on CPU InferenceBuilder are not implemented. We will check if there is any skipped test not because of this reason.

FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher.

delock avatar Oct 03 '23 03:10 delock

Hi @loadams , currently we are seeking higher test coverage in the area of AutoTP. @Yejing-Lai is currently investigating whether more model coverage is possible. On the other hand. Some tests skipped because on CPU InferenceBuilder are not implemented. We will check if there is any skipped test not because of this reason.

FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher.

That makes sense, thanks. Given that this increases the runtime of the cpu_inference tests drastically without increasing coverage, do you think it would make sense to either revert it or add the skip back to the top of the file which seemed to run quicker, then on subsequent PRs we can enable more tests and monitor the overall runtime?

loadams avatar Oct 03 '23 16:10 loadams

Hi @loadams , currently we are seeking higher test coverage in the area of AutoTP. @Yejing-Lai is currently investigating whether more model coverage is possible. On the other hand. Some tests skipped because on CPU InferenceBuilder are not implemented. We will check if there is any skipped test not because of this reason.

FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher.

That makes sense, thanks. Given that this increases the runtime of the cpu_inference tests drastically without increasing coverage, do you think it would make sense to either revert it or add the skip back to the top of the file which seemed to run quicker, then on subsequent PRs we can enable more tests and monitor the overall runtime?

FYI @delock, we disabled the cpu_inference test, but have it so we can run it on demand for PRs that we think will impact things as we fix the test time issue

loadams avatar Oct 03 '23 18:10 loadams

Also #4439 had a way to try and speed up the tests by not setting it all up before skipping but that's wasn't quicker either.

loadams avatar Oct 03 '23 18:10 loadams

I tried to run the line: pytest -m 'seq_inference' unit/ locally and see 3 passed tests with this branch. Yet the same branch get these three test skipped on CI environment. Will need some further investigation with the CI environment. unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-t5] PASSED [ 66%] unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-roberta] PASSED [ 73%] unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-marian] SKIPPED (Acceleraor cpu does not support torch.float16.) [ 80%] unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-codegen] SKIPPED (Acceleraor cpu does not support torch.float16.) [ 86%] unit/inference/test_inference.py::TestAutoTensorParallelism::test[bf16-marian] PASSED

FYI @delock - this looks like it only runs the same 4 tests that it did before? Just making sure that's intentional. Also because of the new placement of the skips rather than the whole file, the tests run much slower than before. We will look at fixing that, but wanted sure make sure that the total number of tests should be higher.

delock avatar Oct 07 '23 02:10 delock

The slow running of test is caused by this line. It loads model list from huggingface_hub. Doing it for every test is slow. A proper fix should make it persistent during the test session. One way is save it as pickle in the first test and load the pickle subsequently. Will test and update this PR. https://github.com/microsoft/DeepSpeed/blob/master/tests/unit/inference/test_inference.py#L68

delock avatar Oct 07 '23 03:10 delock

@loadams the slow test had been fixed with latest commit. How the test run around 13 minutes on my local environment.

The strange thing is on my local environment, there are three tests not skipped, and the skipping message is displayed in my local environment .

On github CI environment, however, these three tests are skipp and the skipping message is not printed with the test report. I tried several attempt to turn on skip message printing, however it seems pytest in CI environment are not affected by the pytest command in the workflow. Do you have any idea and how do I turn on skip message printing in github CI env?

This is my local env.

unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-neo] SKIPPED (This op had not been implemented on this...) [  6%]
unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-neox] SKIPPED (Skipping gpt-neox-20b for now)              [ 13%]
unit/inference/test_inference.py::TestMPSize::test[fp32-bloom] SKIPPED (Bloom models only support half precision, ...) [ 20%]
unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-j] SKIPPED (Test is currently broken)                      [ 26%]
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-neo] SKIPPED (This op had not been implemented on this...) [ 33%]
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-neox] SKIPPED (Skipping gpt-neox-20b for now)              [ 40%]
unit/inference/test_inference.py::TestMPSize::test[fp16-bloom] SKIPPED (This op had not been implemented on this s...) [ 46%]
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-j] SKIPPED (Test is currently broken)                      [ 53%]
unit/inference/test_inference.py::TestAutoTP::test[falcon] SKIPPED (Not enough GPU memory for this on V100 runners)    [ 60%]
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-t5] PASSED                                            [ 66%]
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-roberta] PASSED                                       [ 73%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-marian] SKIPPED (Acceleraor cpu does not su...) [ 80%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-codegen] SKIPPED (Acceleraor cpu does not s...) [ 86%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[bf16-marian] PASSED                                  [ 93%]
unit/inference/test_inference.py::TestAutoTensorParallelism::test[bf16-codegen] SKIPPED (Codegen model(bf16) need ...) [100%]

This is the github CI printout, all tests are skipped and no skip message printed.

unit/inference/test_inference.py::TestAutoTensorParallelism::test[bf16-marian] SKIPPED
unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-marian] SKIPPED
unit/inference/test_inference.py::TestAutoTensorParallelism::test[fp16-codegen] SKIPPED
unit/inference/test_inference.py::TestAutoTensorParallelism::test[bf16-codegen] SKIPPED
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-roberta] SKIPPED
unit/inference/test_inference.py::TestInjectionPolicy::test[fp32-t5] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp16-bloom] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp32-bloom] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-j] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-neox] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-neo] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp32-gpt-j] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-neo] SKIPPED
unit/inference/test_inference.py::TestMPSize::test[fp16-gpt-neox] SKIPPED
unit/inference/test_inference.py::TestAutoTP::test[falcon] SKIPPED (...)\

delock avatar Oct 07 '23 07:10 delock

The skip message is truncated because of 80 column limit when pytest is running. We can expand the 'COLUMNS' env variable to 140 to see skip message. Another issue is recently pytorch upgraded to 2.1.0, this caused Intel Extension for PyTorch load error which subsequently caused all test skipped. Will discuss with Intel Extension for PyTorch team when they will release a wheel for PyTorch 2.1.0.

delock avatar Oct 08 '23 11:10 delock

Intel Extension for PyTorch 2.1 is released. Will update this PR to change workflow to pytorch 2.1 accordingly.

delock avatar Oct 20 '23 02:10 delock

Hi @loadams @tjruwase, we are adding back tensor parallel UT into CPU inference workflow. One issue we met is github instance "ubuntu-20.04" has only one CPU and two cores. We need a stronger system if we want to run autotp test stablely. We will use one of the self-hosted v100 to see whether it fits.

delock avatar Oct 25 '23 02:10 delock

Note currently there is a CPU test failure due to compatibility between https://github.com/microsoft/DeepSpeed/commit/4fc181b01077521ba42379013ce91a1c294e5d8e and pytorch 2.1. We are working on a fix. Will go back to you when a PR is ready

delock avatar Oct 26 '23 06:10 delock

This failure should be fixed by https://github.com/microsoft/DeepSpeed/pull/4578. We should merge #4578 into this PR so they can be tested together on CPU host

Note currently there is a CPU test failure due to compatibility between 4fc181b and pytorch 2.1. We are working on a fix. Will go back to you when a PR is ready

delock avatar Oct 30 '23 02:10 delock

#4578 is merged and CPU workflow ad-hoc running is turned off. @loadams can you help start the workflow to see whether cpu_inference workflow passes? Thanks!

delock avatar Oct 30 '23 05:10 delock

#4578 is merged and CPU workflow ad-hoc running is turned off. @loadams can you help start the workflow to see whether cpu_inference workflow passes? Thanks!

@delock - I tried cherrypicking these changes into a branch of mine so that I could run the workflow dispatch. It was run here, but I'm seeing errors still: https://github.com/microsoft/DeepSpeed/actions/runs/6709501151/job/18232686468

I was able to "resolve" the merge conflict and enable this back in your PR, so it should be running below and we can test just by pushing to this PR.

loadams avatar Oct 31 '23 18:10 loadams

Hi @loadams , thanks for help on the workflow. From this line it seems gcc on self-hosted cpu node is using gcc-7, which have compatibility problem with AVX intrinsics in ccl.cpp kernels. https://github.com/microsoft/DeepSpeed/actions/runs/6709501151/job/18232686468#step:11:147 I can also get the same message from this line, gcc is 7.5.0 https://github.com/microsoft/DeepSpeed/actions/runs/6710829047/job/18236865881?pr=4430#step:7:33

From a previous github hosted CPU system, the comm ops can be successfully built. The gcc version is 9.4.0 https://github.com/delock/DeepSpeedSYCLSupport/actions/runs/6634978766/job/18025197269#step:7:33

So I think to resolve this issue gcc needs to be switched to 9.4.0 or above. What is the steps to switch gcc version on self-hosted CPU node?

#4578 is merged and CPU workflow ad-hoc running is turned off. @loadams can you help start the workflow to see whether cpu_inference workflow passes? Thanks!

@delock - I tried cherrypicking these changes into a branch of mine so that I could run the workflow dispatch. It was run here, but I'm seeing errors still: https://github.com/microsoft/DeepSpeed/actions/runs/6709501151/job/18232686468

I was able to "resolve" the merge conflict and enable this back in your PR, so it should be running below and we can test just by pushing to this PR.

delock avatar Nov 01 '23 13:11 delock

Hi @loadams , thanks for help on the workflow. From this line it seems gcc on self-hosted cpu node is using gcc-7, which have compatibility problem with AVX intrinsics in ccl.cpp kernels. https://github.com/microsoft/DeepSpeed/actions/runs/6709501151/job/18232686468#step:11:147 I can also get the same message from this line, gcc is 7.5.0 https://github.com/microsoft/DeepSpeed/actions/runs/6710829047/job/18236865881?pr=4430#step:7:33

From a previous github hosted CPU system, the comm ops can be successfully built. The gcc version is 9.4.0 https://github.com/delock/DeepSpeedSYCLSupport/actions/runs/6634978766/job/18025197269#step:7:33

So I think to resolve this issue gcc needs to be switched to 9.4.0 or above. What is the steps to switch gcc version on self-hosted CPU node?

#4578 is merged and CPU workflow ad-hoc running is turned off. @loadams can you help start the workflow to see whether cpu_inference workflow passes? Thanks!

@delock - I tried cherrypicking these changes into a branch of mine so that I could run the workflow dispatch. It was run here, but I'm seeing errors still: https://github.com/microsoft/DeepSpeed/actions/runs/6709501151/job/18232686468 I was able to "resolve" the merge conflict and enable this back in your PR, so it should be running below and we can test just by pushing to this PR.

Hi @delock - thanks for the clarification, we don't have an easy way to update it, I will look into getting a newer image running on this, since the fact that it is running ubuntu 18.04 is part of the problem here.

loadams avatar Nov 01 '23 15:11 loadams

Hi @loadams does it make sense to install gcc-9 in the workflow steps, following the link below?

https://askubuntu.com/questions/1140183/install-gcc-9-on-ubuntu-18-04

Hi @delock - thanks for the clarification, we don't have an easy way to update it, I will look into getting a newer image running on this, since the fact that it is running ubuntu 18.04 is part of the problem here.

delock avatar Nov 02 '23 07:11 delock

Hi @loadams does it make sense to install gcc-9 in the workflow steps, following the link below?

https://askubuntu.com/questions/1140183/install-gcc-9-on-ubuntu-18-04

Hi @delock - thanks for the clarification, we don't have an easy way to update it, I will look into getting a newer image running on this, since the fact that it is running ubuntu 18.04 is part of the problem here.

That's a fine option to try for now, I'm not sure if that will work in these pipelines, I've had some trouble with it manually, but it is worth a shot here since it might unblock us while we try to get newer images on these VMs.

loadams avatar Nov 02 '23 15:11 loadams

@loadams I added a section to install gcc-9, let's see if this workaround works.

delock avatar Nov 03 '23 08:11 delock

@loadams I added a section to install gcc-9, let's see if this workaround works.

The install looks to have been successful, however the following check is still returning 7.5.0 for some reason: https://github.com/microsoft/DeepSpeed/actions/runs/6742349658/job/18341364878?pr=4430

loadams avatar Nov 03 '23 14:11 loadams

Let me find a system with ubunt-18 and figure out how this works.

The install looks to have been successful, however the following check is still returning 7.5.0 for some reason: https://github.com/microsoft/DeepSpeed/actions/runs/6742349658/job/18341364878?pr=4430

delock avatar Nov 04 '23 14:11 delock

Let me find a system with ubunt-18 and figure out how this works.

The install looks to have been successful, however the following check is still returning 7.5.0 for some reason: https://github.com/microsoft/DeepSpeed/actions/runs/6742349658/job/18341364878?pr=4430

I found a ubuntu machine and did some experiment, turns out have to use the following commands to actually 'switch' to the new compiler

sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 99
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 99

delock avatar Nov 04 '23 16:11 delock

Let me find a system with ubunt-18 and figure out how this works.

The install looks to have been successful, however the following check is still returning 7.5.0 for some reason: https://github.com/microsoft/DeepSpeed/actions/runs/6742349658/job/18341364878?pr=4430

I found a ubuntu machine and did some experiment, turns out have to use the following commands to actually 'switch' to the new compiler

sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 99
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 99

Looks like at least a new failure?

loadams avatar Nov 06 '23 23:11 loadams

Let me find a system with ubunt-18 and figure out how this works.

The install looks to have been successful, however the following check is still returning 7.5.0 for some reason: https://github.com/microsoft/DeepSpeed/actions/runs/6742349658/job/18341364878?pr=4430

I found a ubuntu machine and did some experiment, turns out have to use the following commands to actually 'switch' to the new compiler

sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 99
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 99

Looks like at least a new failure?

Yes, I'm checking with Intel Extension for Pytorch engineers for this new failure.

Looking in links: https://developer.intel.com/ipex-whl-stable-cpu
ERROR: Could not find a version that satisfies the requirement oneccl_bind_pt (from versions: none)
ERROR: No matching distribution found for oneccl_bind_pt
Error: Process completed with exit code 1.

I'm going to push an alternative way to install from the suggestion.

delock avatar Nov 07 '23 01:11 delock

@loadams We are now using an alternative way to install oneccl_bind_pt following @jingxu10 suggestion. Also added a curl line to see if there are any connectivity issue.

delock avatar Nov 07 '23 03:11 delock

@loadams the cpu_inference workflow failed without specific information, is it because the self-hosted CPU node is not available?

delock avatar Nov 11 '23 11:11 delock

@loadams the cpu_inference workflow failed without specific information, is it because the self-hosted CPU node is not available?

That's what it appears to be - I re-merged the master branch to see if the test would run again and it doesn't appear to be running yet, I will monitor it and see.

loadams avatar Nov 13 '23 17:11 loadams

It looks like self-hosted CPU instance are not available for pickup. Should we try us one of the self-hosted GPU instance to validate the workflow first?

delock avatar Nov 15 '23 03:11 delock

It looks like self-hosted CPU instance are not available for pickup. Should we try us one of the self-hosted GPU instance to validate the workflow first?

@delock - I think this is fixed now, I see it running at least. So please push any other changes, but if we hit this again we can easily change to test on the gpu runners too

loadams avatar Nov 16 '23 00:11 loadams