[Full DTensor] Initial skeleton for full_dtensor mode

Open fegin opened this issue 1 month ago • 0 comments

Stack from ghstack (oldest at bottom):

-> #2049
#2029

This PR provides a skelet

This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own parameterization to better serve our specific use case. There are several reasons why SimpleFSDP's parameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied. Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer.

Accuracy verification: HSDP SimpleFSDP v.s. FSDP2

python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-train-file=torchtitan.experiments.full_dtensor.train  \
--steps=10 --assert-equal --no-seed-checkpoint

[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal
(__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

Note that, --no-seed-checkpoint is used because when seed-checkpoint is used, we got accuracy mismatch.

Nov 17 '25 22:11 fegin