torchrec issues

support sub group collective plan

11

Summary: PP requires non contiguous DMP sharding. In today's torchrec planner, there are various locations where ranks are assumed to be contiguous, this prevents intra host pipeline parallel to utilize...

xunnanxu

CLA Signed

fb-exported

add not running with deploy when calling torch.compiler

1

Reviewed By: PaulZhang12 Differential Revision: D55389988

s4ayub

CLA Signed

fb-exported

[Question] why torchrec explicit make dp lookup as DistributedDataParallel instead of letting DistributedModelParallel handle it?

Hi, team, In the `ShardedEmbeddingBagCollection`, I found torchrec explicit make dp lookup as `DistributedDataParallel`([code here](https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/embeddingbag.py#L503)). And I also know inside [DistributedModelParallel](https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/model_parallel.py#L216) we have ddp wrapper to warp the non-sharded part...

shijieliu

Make sure unweighted emb lookup executes before fp emb lookup

2

Summary: Sort group keys in embedding_sharding so keys(lookups) with has_feature_processor=True executes first. Differential Revision: D55045404

zhouxie000

CLA Signed

fb-exported

Colab demo not working

4

When I try to walk through the steps of the Colab demo for Torchrec, I get this error: Here is the link to the demo: https://colab.research.google.com/github/pytorch/torchrec/blob/main/Torchrec_Introduction.ipynb#scrollTo=4-v17rxkopQw

devinbost

Limit search space to budgets reachable by scaleup.

3

Summary: When available scaleup budget is larger then the amount of memory to promote all eligible scaleup tables to HBM, limit the search space to this ceiling, else we'll consume...

damianr99

CLA Signed

fb-exported

Final big migration push [3/22]

1

Differential Revision: D54731104

connernilsen

CLA Signed

fb-exported

cpu ci timeout

1

Differential Revision: D54756659

PaulZhang12

CLA Signed

fb-exported

error with lengths = [1, 1, 1, 1,...., 1, 1, 1]

1

Hello, I generated the KJT with lengths is a tensor with full of 1s. then I get an error as: keys = ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', ...], stride...

xiexbing

Add comments to min_partition

2

Summary: As titled Created from CodeHub with https://fburl.com/edit-in-codehub Reviewed By: sarckk Differential Revision: D54489049

gnahzg

CLA Signed

fb-exported

torchrec
torchrec copied to clipboard

Metadata

support sub group collective plan

add not running with deploy when calling torch.compiler

[Question] why torchrec explicit make dp lookup as DistributedDataParallel instead of letting DistributedModelParallel handle it?

Make sure unweighted emb lookup executes before fp emb lookup

Colab demo not working

Limit search space to budgets reachable by scaleup.

Final big migration push [3/22]

cpu ci timeout

error with lengths = [1, 1, 1, 1,...., 1, 1, 1]

Add comments to min_partition

← Metadata

Owner

Metadata

torchrec torchrec copied to clipboard

Metadata

← Metadata

Owner

Metadata

torchrec
torchrec copied to clipboard