Chien-Chin Huang

Results 119 comments of Chien-Chin Huang

Link to landed trunk PR (if applicable): * https://github.com/pytorch/pytorch/pull/128446 * https://github.com/pytorch/pytorch/pull/128755 (test) Link to release branch PR: * https://github.com/pytorch/pytorch/pull/129254 * https://github.com/pytorch/pytorch/pull/129255 (test) Criteria Category: * Critical fixes to crashes when...

Just a random thought. Do we need `torch.pipelining.Schedule` to do step 2 option2? Can we first traverse and bookkeep the module initialization order (before pipeline is applied) and then replay...

> I'm not even considering the ordering of modules being nontrivial. I just assume its straightforward ordering (iterating the layers dict). But i suppose we do not have to keep...

@tianyu-l I changed `data_parallel_replicate_degree` default to 1. But I don't see why `ParallelDims` logic would be simplified, are we not allowed `data_parallel_replicate_degree` to be -1? That will be a different...

dcp.save() works with both DTensor and Tensor. Rank0 will determine what to save on each rank. If tensors are not duplicated (FQNs are different), all the tensors will be saved...

3) looks suspicious and just as you mentioned, there are key conflicts. We have tested the non-virtual pipeline and there are non key conflict. Any insight about this, @wconstab, @H-Huang...

Is it possible that the tokenizer is corrupted? Can you re-download the tokenizer and try again?

For the second method mentioned by @yifuwang, both `ColwiseParallel` and `RowwiseParalle` have the options to convert the output to local tensors. Both also convert the input tensor to DTensor if...

This looks like a checkpointing problem. Can you confirm 1) does this only happen after checkpoint resume? 2) does this only happen when vocab size is not divisible by world...

Add some experiments. 1. I added following code after the checkpoint save in the trainer and set the saving frequency to 10, the loss curve is fine. ``` if self.step...