Jiarui Fang(方佳瑞)

Results 220 comments of Jiarui Fang(方佳瑞)

A similar method has been proposed in [TurboTransformer Paper](https://dl.acm.org/doi/10.1145/3437801.3441578) to reduce sync times in cuda programming...

I believe the feature depends on #256

I believe the feature depends on #256

@wohaocaiji @zixiliuUSC I think the import error has been fixed in the latest main branch.

> I also don't think it makes sense for Colossal AI to use the name of `ShardedModel`. Because for ZeRO1 and 2 we don't actually split the model. This name...

We are planning to provide this feature this week (28th May).

I agree 3D parallel can shrink the peak activation footprint in one GPU at cost of more communication. The method definitely works in some special cases. Maybe a simple searching...

@1SAA communication profiling results may support some of my assumption iin discussion.

I think ZeRO does not support pack_padded_sequence right now. Since RNN usually does not have too many parameters. Since DP is often enough for RNNs, we do not test RNN...