Adding DS Feature API in accelerator
This PR is a prototype of adding API for capabilities in accelerators including:
- define capabilities in abstract_accelerator
- set capabilities in cuda_accelerator
Welcome hardware vendors to define capabilities for their own hardware.
Hi @duli2012 thanks for adding this interface. I have always been worring accelerator interface# may grow too big when we propose more and more capabilities into it, this interface is a good way to put all capabilities into one interface. My comments below:
- For zero1/2/3, it sounds like accelerator agnostic features. Basically it should be supported if accelerator runtime and communication collectives is well implemented.
- For sparse_attn, it is in the category of whether certain OpBuilder is implemented. Previous way of telling whether ops is implemented is through compatible ops as the following link. I think its better to move this capability group into capability interface since the new interfacve look more intuitive to use. Is it possible to make it auto reflect accelerator OpBuilder implementation so there will be less maintentance work? A minor suggestion is name capability in this group with same prefix i.e.
op.sparse_attnhttps://github.com/microsoft/DeepSpeed/blob/b22706a7211366abf2df98a0d118ea1d3a837e21/tests/unit/inference/test_inference.py#L336 - For 1-bit adam, I agree this is a case that could be covered by this interface.
What should we do with already existing accelerator interface that falls into category of capabilities? Should they be added to the new inferface or just keep them that way? I think that's an open to discuss.
Hi @duli2012 thanks for adding this interface. I have always been worring accelerator interface# may grow too big when we propose more and more capabilities into it, this interface is a good way to put all capabilities into one interface. My comments below:
- For zero1/2/3, it sounds like accelerator agnostic features. Basically it should be supported if accelerator runtime and communication collectives is well implemented.
- For sparse_attn, it is in the category of whether certain OpBuilder is implemented. Previous way of telling whether ops is implemented is through compatible ops as the following link. I think its better to move this capability group into capability interface since the new interfacve look more intuitive to use. Is it possible to make it auto reflect accelerator OpBuilder implementation so there will be less maintentance work? A minor suggestion is name capability in this group with same prefix i.e.
op.sparse_attnhttps://github.com/microsoft/DeepSpeed/blob/b22706a7211366abf2df98a0d118ea1d3a837e21/tests/unit/inference/test_inference.py#L336- For 1-bit adam, I agree this is a case that could be covered by this interface.
What should we do with already existing accelerator interface that falls into category of capabilities? Should they be added to the new inferface or just keep them that way? I think that's an open to discuss.
Thanks @delock for your comments. I integrated most of them. please take a look.
Thanks @duli2012 , my intuition is zero 1/2/3 should not among accelerator feature list. Zero stage code is shared between different accelerators and there is no interface specific to zero stage, so I wonder what makes an accelerator not support zero features?
For OP related features i.e. OP_ASYNC_IO, etc. I think there needs to be an mechanism sync them with opbuilder implementation state automatically, otherwise there will be manual maintenance cost each time a new op is introduced.