DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Adding DS Feature API in accelerator

Open duli2012 opened this issue 1 year ago • 3 comments

This PR is a prototype of adding API for capabilities in accelerators including:

  1. define capabilities in abstract_accelerator
  2. set capabilities in cuda_accelerator

Welcome hardware vendors to define capabilities for their own hardware.

duli2012 avatar Apr 16 '24 20:04 duli2012

Hi @duli2012 thanks for adding this interface. I have always been worring accelerator interface# may grow too big when we propose more and more capabilities into it, this interface is a good way to put all capabilities into one interface. My comments below:

  1. For zero1/2/3, it sounds like accelerator agnostic features. Basically it should be supported if accelerator runtime and communication collectives is well implemented.
  2. For sparse_attn, it is in the category of whether certain OpBuilder is implemented. Previous way of telling whether ops is implemented is through compatible ops as the following link. I think its better to move this capability group into capability interface since the new interfacve look more intuitive to use. Is it possible to make it auto reflect accelerator OpBuilder implementation so there will be less maintentance work? A minor suggestion is name capability in this group with same prefix i.e. op.sparse_attn https://github.com/microsoft/DeepSpeed/blob/b22706a7211366abf2df98a0d118ea1d3a837e21/tests/unit/inference/test_inference.py#L336
  3. For 1-bit adam, I agree this is a case that could be covered by this interface.

What should we do with already existing accelerator interface that falls into category of capabilities? Should they be added to the new inferface or just keep them that way? I think that's an open to discuss.

delock avatar Apr 18 '24 00:04 delock

Hi @duli2012 thanks for adding this interface. I have always been worring accelerator interface# may grow too big when we propose more and more capabilities into it, this interface is a good way to put all capabilities into one interface. My comments below:

  1. For zero1/2/3, it sounds like accelerator agnostic features. Basically it should be supported if accelerator runtime and communication collectives is well implemented.
  2. For sparse_attn, it is in the category of whether certain OpBuilder is implemented. Previous way of telling whether ops is implemented is through compatible ops as the following link. I think its better to move this capability group into capability interface since the new interfacve look more intuitive to use. Is it possible to make it auto reflect accelerator OpBuilder implementation so there will be less maintentance work? A minor suggestion is name capability in this group with same prefix i.e. op.sparse_attn https://github.com/microsoft/DeepSpeed/blob/b22706a7211366abf2df98a0d118ea1d3a837e21/tests/unit/inference/test_inference.py#L336
  3. For 1-bit adam, I agree this is a case that could be covered by this interface.

What should we do with already existing accelerator interface that falls into category of capabilities? Should they be added to the new inferface or just keep them that way? I think that's an open to discuss.

Thanks @delock for your comments. I integrated most of them. please take a look.

duli2012 avatar Apr 20 '24 00:04 duli2012

Thanks @duli2012 , my intuition is zero 1/2/3 should not among accelerator feature list. Zero stage code is shared between different accelerators and there is no interface specific to zero stage, so I wonder what makes an accelerator not support zero features?

For OP related features i.e. OP_ASYNC_IO, etc. I think there needs to be an mechanism sync them with opbuilder implementation state automatically, otherwise there will be manual maintenance cost each time a new op is introduced.

delock avatar Apr 24 '24 07:04 delock