DeepSpeed Support cpu tensors without direct device invocation

Motivation: Fix for reproducible issue #3837 on cpu. On cpus direct invocation of torch.cpu.tensor leads to dtype mismatch. Another way would be to have something like : ["torch.DoubleTensor" if device_type == 'cpu else '"torch.{}.DoubleTensor".format(device_type)] for all elements in the supported list , but that would eliminate "torch.cpu.DoubleTensor" ,etc from the scope. @jeffra requesting review.

CLA is signed

Jun 29 '23 04:06 abhilash1910

@abhilash1910, thanks for this PR. I think this PR needs some work that leverages PR #3633 for the following reasons.

As you observed, strings like torch.cpu.DoubleTensor are invalid, and the underlying issue is that existing code is now broken with the support for accelerators other than cuda. So, it will be better to remove code that generates this invalid strings.
#3633 is adding API for querying the dtypes supported by the accelerator, and it seems to me that this entire logic could be rewritten to use that API. I think this API is less brittle since it does not rely on string formatting, and is easier to maintain since the respective accelerator owners can update their list of supported dtypes.

Please share your thoughts. Thanks!

Jun 30 '23 11:06 tjruwase

@tjruwase yes I think that would be a proper fix , instead of having separate dtypes, we can directly leverage abstract accelerator interface. Let me go through the changes for this . Thanks (Making this draft for now). I see the changes from my colleague Yejing, will discuss and work on this .

Jun 30 '23 16:06 abhilash1910

@abhilash1910, thanks for your alignment. I will push to get #3633 merged asap. I left some comments on there.

Jun 30 '23 16:06 tjruwase

Yes sure, I will work with my colleague Yejing to make this work.

Jun 30 '23 16:06 abhilash1910

@abhilash1910, thanks for this PR. I think this PR needs some work that leverages PR #3633 for the following reasons.

As you observed, strings like torch.cpu.DoubleTensor are invalid, and the underlying issue is that existing code is now broken with the support for accelerators other than cuda. So, it will be better to remove code that generates this invalid strings.

[CPU] Skip CPU support unimplemented error #3633 is adding API for querying the dtypes supported by the accelerator, and it seems to me that this entire logic could be rewritten to use that API. I think this API is less brittle since it does not rely on string formatting, and is easier to maintain since the respective accelerator owners can update their list of supported dtypes.

Please share your thoughts. Thanks!

@tjruwase I have a question. For the following check, is the check focusing on the data type, or data type+device type is needed? If only data type is important, then proper way is strip device type from t.type(), then compare with data type list.

Jul 17 '23 03:07 delock

@tjruwase I have a question. For the following check, is the check focusing on the data type, or data type+device type is needed? If only data type is important, then proper way is strip device type from t.type(), then compare with data type list.

Yes, the focus is on checking the data type supported by the device. My feedback is based on avoiding any assumptions of string format combinations of device and data type. For example as shown in this list, the same dtype is formatted differently for cpu and cuda tensors. So, I suggested relying on torch dtype which is canonical, and which has an accelerator API to retrieve.

Please let me know your thoughts.

Jul 17 '23 20:07 tjruwase

Yes, dtype is better. Some additional changed in _reduce_non_expert_gradients and _reduce_expert_gradients will be needed accordingly.

Jul 18 '23 10:07 delock

Yes, dtype is better. Some additional changed in _reduce_non_expert_gradients and _reduce_expert_gradients will be needed accordingly.

@delock, thanks for the pointer. @abhilash1910, could you please help handle those changes in your PR?

Jul 18 '23 16:07 tjruwase

@tjruwase @delock _reduce_non_expert_gradients and _reduce_expert_gradients uses SparseTensor class, would it make sense to replace type with dtype there ?

Jul 19 '23 03:07 abhilash1910

@abhilash1910, you raise an important issue with my proposal that I had overlooked. This is that SparseTensor.type() is a string while torch.dtype is not. My proposal would result in populating supported_types with objects of different types, which creates all kinds of complications.

One idea is to populate the list with [f'{t}' for t in accelerator.supported_dtypes()] to harmonize the object types. But the problem is that while SparseTensor.type() will work for lookups, neither torch.Tensor.type() nor torch.Tensor.dtype will work. In other words, we will always need different lookup logic for torch tensors on one hand, and deepspeed sparse_tensors on the other hand.

Another potential issue with existing code is that we don't check whether the dtype of the underlying tensor of a SparseTensor is supported by the accelerator. I need to double check this concern with my teammates.

Therefore, I am now wondering it would better to take a new approach that makes this difference explicit, and uses isinstance(t, torch.Tensor) and isinstance(t, SparseTensor) appropriately to guard the lookups. I think this will make the code easier to maintain since the logic is more explicit. Also, I think is more intuitive for split_half_float_double_sparse() function to return separate buckets for sparse and dense tensors for the callers to handle appropriately.

These are just some thoughts. Please let me know what you think. Thanks!

Jul 20 '23 15:07 tjruwase

@abhilash1910, you raise an important issue with my proposal that I had overlooked. This is that SparseTensor.type() is a string while torch.dtype is not. My proposal would result in populating supported_types with objects of different types, which creates all kinds of complications.

One idea is to populate the list with [f'{t}' for t in accelerator.supported_dtypes()] to harmonize the object types. But the problem is that while SparseTensor.type() will work for lookups, neither torch.Tensor.type() nor torch.Tensor.dtype will work. In other words, we will always need different lookup logic for torch tensors on one hand, and deepspeed sparse_tensors on the other hand.

Another potential issue with existing code is that we don't check whether the dtype of the underlying tensor of a SparseTensor is supported by the accelerator. I need to double check this concern with my teammates.

Therefore, I am now wondering it would better to take a new approach that makes this difference explicit, and uses isinstance(t, torch.Tensor) and isinstance(t, SparseTensor) appropriately to guard the lookups. I think this will make the code easier to maintain since the logic is more explicit. Also, I think is more intuitive for split_half_float_double_sparse() function to return separate buckets for sparse and dense tensors for the callers to handle appropriately.

These are just some thoughts. Please let me know what you think. Thanks!

Yes my thoughts exactly. I was thinking of adding a dtype getter inside the sparsetensor to make it consistent But checking the accelerator device type during getting gradients for reduction should be present .@tjruwase @jeffra . Let me make the additions.

Jul 21 '23 02:07 abhilash1910

@tjruwase @jeffra does _reduce_non_expert_gradients and _reduce_expert_gradients also need dtype changes , since we are adding the attribute . Also not sure why the CLA is appearing. @microsoft-github-policy-service agree company="Intel"

Jul 22 '23 11:07 abhilash1910

@tjruwase @jeffra does _reduce_non_expert_gradients and _reduce_expert_gradients also need dtype changes , since we are adding the attribute . Also not sure why the CLA is appearing. @microsoft-github-policy-service agree company="Intel"

My proposal is for split_half_float_double_sparse() to return separate buckets for sparse and dense tensors. In that case, _reduce_non_expert_gradients() and _reduce_expert_gradients() would need no longer need type tests in order to process the buckets appropriately. Please let me know if clarification is needed on this.

I am not sure what is going with the CLA.

Jul 25 '23 13:07 tjruwase

@tjruwase could you please re-review? Thanks

Aug 07 '23 18:08 abhilash1910

@tjruwase could you re-trigger CI and re-review ? Thanks

Aug 10 '23 05:08 abhilash1910

@abhilash1910, thanks for persevering to deliver this awesome contribution.

Aug 10 '23 20:08 tjruwase

@abhilash1910, can you please use this guide to address the formatting issues? Thanks!

Aug 11 '23 13:08 tjruwase

@tjruwase could you help re-run CI ? Thanks

Aug 14 '23 13:08 abhilash1910

@abhilash1910, did you notice this CI failure?

Aug 25 '23 13:08 tjruwase

@tjruwase Thanks for highlighting, I currently did a fix for this, let me know your suggestions ? Thanks

Aug 25 '23 14:08 abhilash1910

@tjruwase could you review once? I think this might resolve this issue ; I was not able to pinpoint the exact cause through locally, so any pointers will be helpful.

Sep 03 '23 19:09 abhilash1910

@tjruwase could you review once? I think this might resolve this issue ; I was not able to pinpoint the exact cause through locally, so any pointers will be helpful.

Were you able to repro the issue in locally?

Sep 06 '23 12:09 tjruwase

@tjruwase could you help re-trigger CI? Thanks

Nov 22 '23 09:11 abhilash1910

Hi @abhilash1910 can you check and fix the following error? https://github.com/microsoft/DeepSpeed/actions/runs/6952457019/job/18941598524?pr=3842#step:8:4568

  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 126, in split_half_float_double_sparse
    assert t.dtype in supported_types, f"attempting to reduce an unsupported grad type: {t.dtype}"
AttributeError: 'str' object has no attribute 'dtype'

Nov 23 '23 03:11 delock

@tjruwase could you retrigger CI (issue seems to be fix now)? Thanks.

Nov 24 '23 19:11 abhilash1910

@tjruwase could you help re-trigger the CI and re-review ? Much appreciated.

Nov 29 '23 10:11 abhilash1910

Hi @abhilash1910 can you clarify whether current failures in CI is related to your PR or just a test issue? Thanks!

Dec 08 '23 00:12 delock

@delock I think that it might be a test issue as I am able to run the CI for the sparse test locally. I changed the pathway of code and still I see the same allclose issue. @tjruwase could you suggest any modifications on this? This is strange as I tested in an isolated env and did not get the issue.

Dec 08 '23 04:12 abhilash1910

Hi @abhilash1910 some suggestions:

provide more details (hw, sw, log ...) of your local run so there might be hint of difference.
try to modify the test as a standalone workload (not using pytest) so debugging could be possible
seperate assert for dense and sparse grad so you will know which type of tensor have difference. This might narrow down the location of possible bug.

Dec 08 '23 08:12 delock

Thanks @inkcherry for highlighting the boundary issue ; seems it will pass the CI now. @tjruwase could you help retrigger the CI ? Thanks

Dec 14 '23 09:12 abhilash1910