Support cpu tensors without direct device invocation
Motivation: Fix for reproducible issue #3837 on cpu. On cpus direct invocation of torch.cpu.tensor leads to dtype mismatch. Another way would be to have something like : ["torch.DoubleTensor" if device_type == 'cpu else '"torch.{}.DoubleTensor".format(device_type)] for all elements in the supported list , but that would eliminate "torch.cpu.DoubleTensor" ,etc from the scope. @jeffra requesting review.
CLA is signed
@abhilash1910, thanks for this PR. I think this PR needs some work that leverages PR #3633 for the following reasons.
- As you observed, strings like
torch.cpu.DoubleTensorare invalid, and the underlying issue is that existing code is now broken with the support for accelerators other than cuda. So, it will be better to remove code that generates this invalid strings. - #3633 is adding API for querying the dtypes supported by the accelerator, and it seems to me that this entire logic could be rewritten to use that API. I think this API is less brittle since it does not rely on string formatting, and is easier to maintain since the respective accelerator owners can update their list of supported dtypes.
Please share your thoughts. Thanks!
@tjruwase yes I think that would be a proper fix , instead of having separate dtypes, we can directly leverage abstract accelerator interface. Let me go through the changes for this . Thanks (Making this draft for now). I see the changes from my colleague Yejing, will discuss and work on this .
@abhilash1910, thanks for your alignment. I will push to get #3633 merged asap. I left some comments on there.
Yes sure, I will work with my colleague Yejing to make this work.
@abhilash1910, thanks for this PR. I think this PR needs some work that leverages PR #3633 for the following reasons.
- As you observed, strings like
torch.cpu.DoubleTensorare invalid, and the underlying issue is that existing code is now broken with the support for accelerators other than cuda. So, it will be better to remove code that generates this invalid strings.- [CPU] Skip CPU support unimplemented error #3633 is adding API for querying the dtypes supported by the accelerator, and it seems to me that this entire logic could be rewritten to use that API. I think this API is less brittle since it does not rely on string formatting, and is easier to maintain since the respective accelerator owners can update their list of supported dtypes.
Please share your thoughts. Thanks!
@tjruwase I have a question. For the following check, is the check focusing on the data type, or data type+device type is needed? If only data type is important, then proper way is strip device type from t.type(), then compare with data type list.
@tjruwase I have a question. For the following check, is the check focusing on the data type, or data type+device type is needed? If only data type is important, then proper way is strip device type from t.type(), then compare with data type list.
Yes, the focus is on checking the data type supported by the device. My feedback is based on avoiding any assumptions of string format combinations of device and data type. For example as shown in this list, the same dtype is formatted differently for cpu and cuda tensors. So, I
suggested relying on torch dtype which is canonical, and which has an accelerator API to retrieve.
Please let me know your thoughts.
Yes, dtype is better. Some additional changed in _reduce_non_expert_gradients and _reduce_expert_gradients will be needed accordingly.
Yes, dtype is better. Some additional changed in _reduce_non_expert_gradients and _reduce_expert_gradients will be needed accordingly.
@delock, thanks for the pointer. @abhilash1910, could you please help handle those changes in your PR?
@tjruwase @delock _reduce_non_expert_gradients and _reduce_expert_gradients uses SparseTensor class, would it make sense to replace type with dtype there ?
@abhilash1910, you raise an important issue with my proposal that I had overlooked. This is that SparseTensor.type() is a string while torch.dtype is not. My proposal would result in populating supported_types with objects of different types, which creates all kinds of complications.
One idea is to populate the list with [f'{t}' for t in accelerator.supported_dtypes()] to harmonize the object types. But the problem is that while SparseTensor.type() will work for lookups, neither torch.Tensor.type() nor torch.Tensor.dtype will work. In other words, we will always need different lookup logic for torch tensors on one hand, and deepspeed sparse_tensors on the other hand.
Another potential issue with existing code is that we don't check whether the dtype of the underlying tensor of a SparseTensor is supported by the accelerator. I need to double check this concern with my teammates.
Therefore, I am now wondering it would better to take a new approach that makes this difference explicit, and uses isinstance(t, torch.Tensor) and isinstance(t, SparseTensor) appropriately to guard the lookups. I think this will make the code easier to maintain since the logic is more explicit. Also, I think is more intuitive for split_half_float_double_sparse() function to return separate buckets for sparse and dense tensors for the callers to handle appropriately.
These are just some thoughts. Please let me know what you think. Thanks!
@abhilash1910, you raise an important issue with my proposal that I had overlooked. This is that
SparseTensor.type()is a string whiletorch.dtypeis not. My proposal would result in populatingsupported_typeswith objects of different types, which creates all kinds of complications.One idea is to populate the list with
[f'{t}' for t in accelerator.supported_dtypes()]to harmonize the object types. But the problem is that whileSparseTensor.type()will work for lookups, neithertorch.Tensor.type()nortorch.Tensor.dtypewill work. In other words, we will always need different lookup logic for torch tensors on one hand, and deepspeed sparse_tensors on the other hand.Another potential issue with existing code is that we don't check whether the dtype of the underlying tensor of a SparseTensor is supported by the accelerator. I need to double check this concern with my teammates.
Therefore, I am now wondering it would better to take a new approach that makes this difference explicit, and uses
isinstance(t, torch.Tensor)andisinstance(t, SparseTensor)appropriately to guard the lookups. I think this will make the code easier to maintain since the logic is more explicit. Also, I think is more intuitive forsplit_half_float_double_sparse()function to return separate buckets for sparse and dense tensors for the callers to handle appropriately.These are just some thoughts. Please let me know what you think. Thanks!
Yes my thoughts exactly. I was thinking of adding a dtype getter inside the sparsetensor to make it consistent But checking the accelerator device type during getting gradients for reduction should be present .@tjruwase @jeffra . Let me make the additions.
@tjruwase @jeffra does _reduce_non_expert_gradients and _reduce_expert_gradients also need dtype changes , since we are adding the attribute . Also not sure why the CLA is appearing. @microsoft-github-policy-service agree company="Intel"
@tjruwase @jeffra does _reduce_non_expert_gradients and _reduce_expert_gradients also need dtype changes , since we are adding the attribute . Also not sure why the CLA is appearing. @microsoft-github-policy-service agree company="Intel"
My proposal is for split_half_float_double_sparse() to return separate buckets for sparse and dense tensors. In that case, _reduce_non_expert_gradients() and _reduce_expert_gradients() would need no longer need type tests in order to process the buckets appropriately. Please let me know if clarification is needed on this.
I am not sure what is going with the CLA.
@tjruwase could you please re-review? Thanks
@tjruwase could you re-trigger CI and re-review ? Thanks
@abhilash1910, thanks for persevering to deliver this awesome contribution.
@abhilash1910, can you please use this guide to address the formatting issues? Thanks!
@tjruwase could you help re-run CI ? Thanks
@abhilash1910, did you notice this CI failure?
@tjruwase Thanks for highlighting, I currently did a fix for this, let me know your suggestions ? Thanks
@tjruwase could you review once? I think this might resolve this issue ; I was not able to pinpoint the exact cause through locally, so any pointers will be helpful.
@tjruwase could you review once? I think this might resolve this issue ; I was not able to pinpoint the exact cause through locally, so any pointers will be helpful.
Were you able to repro the issue in locally?
@tjruwase could you help re-trigger CI? Thanks
Hi @abhilash1910 can you check and fix the following error? https://github.com/microsoft/DeepSpeed/actions/runs/6952457019/job/18941598524?pr=3842#step:8:4568
File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 126, in split_half_float_double_sparse
assert t.dtype in supported_types, f"attempting to reduce an unsupported grad type: {t.dtype}"
AttributeError: 'str' object has no attribute 'dtype'
@tjruwase could you retrigger CI (issue seems to be fix now)? Thanks.
@tjruwase could you help re-trigger the CI and re-review ? Much appreciated.
Hi @abhilash1910 can you clarify whether current failures in CI is related to your PR or just a test issue? Thanks!
@delock I think that it might be a test issue as I am able to run the CI for the sparse test locally. I changed the pathway of code and still I see the same allclose issue. @tjruwase could you suggest any modifications on this? This is strange as I tested in an isolated env and did not get the issue.
Hi @abhilash1910 some suggestions:
- provide more details (hw, sw, log ...) of your local run so there might be hint of difference.
- try to modify the test as a standalone workload (not using pytest) so debugging could be possible
- seperate assert for dense and sparse grad so you will know which type of tensor have difference. This might narrow down the location of possible bug.
Thanks @inkcherry for highlighting the boundary issue ; seems it will pass the CI now. @tjruwase could you help retrigger the CI ? Thanks
