Chien-Chin Huang comments

Results 119 comments of


                                            Chien-Chin Huang

[v.2.4.0] Release Tracker

Link to landed trunk PR (if applicable): * https://github.com/pytorch/pytorch/pull/128446 * https://github.com/pytorch/pytorch/pull/128755 (test) Link to release branch PR: * https://github.com/pytorch/pytorch/pull/129254 * https://github.com/pytorch/pytorch/pull/129255 (test) Criteria Category: * Critical fixes to crashes when...

[rfc] getting rid of seed-checkpoint for Pipeline Parallelism

Just a random thought. Do we need `torch.pipelining.Schedule` to do step 2 option2? Can we first traverse and bookkeep the module initialization order (before pipeline is applied) and then replay...

[rfc] getting rid of seed-checkpoint for Pipeline Parallelism

> I'm not even considering the ordering of modules being nontrivial. I just assume its straightforward ordering (iterating the layers dict). But i suppose we do not have to keep...

[RFC] Enable HSDP

@tianyu-l I changed `data_parallel_replicate_degree` default to 1. But I don't see why `ParallelDims` logic would be simplified, are we not allowed `data_parallel_replicate_degree` to be -1? That will be a different...

Only half of parameters are saved when applied PP

dcp.save() works with both DTensor and Tensor. Rank0 will determine what to save on each rank. If tensors are not duplicated (FQNs are different), all the tensors will be saved...

Only half of parameters are saved when applied PP

3) looks suspicious and just as you mentioned, there are key conflicts. We have tested the non-virtual pipeline and there are non key conflict. Any insight about this, @wconstab, @H-Huang...

train llama3 error

Is it possible that the tokenizer is corrupted? Can you re-download the tokenizer and try again?

Question about custom cuda operators for tensor parallelism

For the second method mentioned by @yifuwang, both `ColwiseParallel` and `RowwiseParalle` have the options to convert the output to local tensors. Both also convert the input tensor to DTensor if...

feat: refactor provider registry for safety and improved diagnostics

This looks like a checkpointing problem. Can you confirm 1) does this only happen after checkpoint resume? 2) does this only happen when vocab size is not divisible by world...

feat: refactor provider registry for safety and improved diagnostics

Add some experiments. 1. I added following code after the checkpoint save in the trainer and set the saving frequency to 10, the loss curve is fine. ``` if self.step...