fix opt-350m shard loading issue in AutoTP
@delock @tjruwase please help review
@tjruwase @jeffra could assign a reviewer for this PR? This PR fix OPT checkpoint sharded loading with AutoTP and improve OPT+AutoTP usability, it is needed when run OPT models on CPU server with small memory.
@RezaYazdaniAminabadi can you review this PR? This PR fix OPT sharded loading for AutoTP. Previously only OPT-125m has sharded checkpoint loading, with this fix OPT >350m will have sharded checkpoint loading as well.
@RezaYazdaniAminabadi Hi, a quick check whether this PR is still under consideration. We have verified this PR for CPU accelerator and like to know whether it could be merged into master branch, thanks!