Andrew Gu comments

Results 159 comments of


                                            Andrew Gu

Fused Linear and Cross-Entropy Loss `torch.nn.functional.linear_cross_entropy`

cc: @Chillee would be curious to hear your thoughts

[FSDP2] Computed grad divide factors at runtime

We do not have a unit test that can capture the difference between fp32/bf16 vs. fp16 division factors yet. It might be simpler to test this when we do implement...

[FSDP2] Computed grad divide factors at runtime

> > * Upon the second microbatch, rank 0 after its reduce-scatter will additionally have its shard of 18∑i∈S(0)gi(2). If we only all-reduce this, then this second microbatch's gradients become...

[FSDP2] Computed grad divide factors at runtime

> > Sorry, the point of what we are trying to do is to _not_ all-reduce the first microbatch's gradients. This is to save communication. Just reduce-scattering is enough to...

[FSDP2] Computed grad divide factors at runtime

@pytorchbot merge

FSDP+PP bug where reshard_after_forward must be true

This seems like an important / high(er) priority issue since FSDP + PP generally wants `reshard_after_forward=False`.

FSDP+PP bug where reshard_after_forward must be true

`reshard_after_forward=True` == `ShardingStrategy.FULL_SHARD` == ZeRO-3 `reshard_after_forward=False` == `ShardingStrategy.SHARD_GRAD_OP` == ZeRO-2 It is still the same (cannot be automatically figured out -- only the root module auto changes to `reshard_after_forward=False` since...

Andrew Gu

Fused Linear and Cross-Entropy Loss `torch.nn.functional.linear_cross_entropy`

[FSDP2] Computed grad divide factors at runtime

[FSDP2] Computed grad divide factors at runtime

[FSDP2] Computed grad divide factors at runtime

[FSDP2] Computed grad divide factors at runtime

FSDP+PP bug where reshard_after_forward must be true

FSDP+PP bug where reshard_after_forward must be true

FSDP+PP bug where reshard_after_forward must be true

Question on Model Init

Question on Model Init