torchrec
torchrec copied to clipboard
Use Torchrec OSS Planner (2/2)
Summary:
Incorporate Torchrec's OSS planner into train_module. The OSS planner can be enabled by setting "use_torchrec_oss_planner" flag to true in SharderOptions (https://fburl.com/code/xjlbrscm). When the flag is true, run_planner calls run_oss_planner and returns a plan if one can be found. If not found, ShardingError will be raised, which aligns with the current behavior of run_planner. The current usage of planner does not support UVM hybrid mode, as the plan searching time can exceed 20 min, which is too long for dry-run.
The components for the OSS planner are as follows:
(1) Topology: both hbm_cap and ddr_cap come from planner_storage_in_gb. Since planner_storage_in_gb["hbm"] removes reserved_hbm_size when being constructed, we need to add reserved_hbm_size back for planner to see the whole hbm storage.
(2) reserved_storage: HeuristicalStorageReservation, with the reserved percentage coming from reserved_hbm_size/total_hbm_storage.
(3) perf_estimator: not specified, so the default EmbeddingPerfEstimator is used.
(4) storage_estimator: not specified, so the default EmbeddingStorageEstimator is used.
(5) constraints: currently sharding_types, compute_kernels, pooling factors and min_partition are specified. In the future, caching_ratio (that takes user-specified reserved_hbm_size_for_cache) can be added.
Differential Revision: D37948466
This pull request was exported from Phabricator. Differential Revision: D37948466
This pull request was exported from Phabricator. Differential Revision: D37948466
This pull request was exported from Phabricator. Differential Revision: D37948466
This pull request was exported from Phabricator. Differential Revision: D37948466
This pull request was exported from Phabricator. Differential Revision: D37948466
This pull request was exported from Phabricator. Differential Revision: D37948466
This pull request was exported from Phabricator. Differential Revision: D37948466
This pull request was exported from Phabricator. Differential Revision: D37948466