torchtitan
torchtitan copied to clipboard
[Compiler Toolkit] Enable nested_compile_region on TransformerBlock
Need to run with fix in https://github.com/pytorch/pytorch/pull/166702
NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4
Current output: P2016557983
Observations
- I see each TransformerBlock becomes one subgraph, look for
subgraph_0,subgraph_2... This is not what we want. we should see 1 instance of subgraph_0, and multiple invoke_subgraph nodes on the same subgraph_0, with different layer weights. - Due to AC, we also have hop.tag_activation_checkpoint(
subgraph_1), where subgraph_1 internally calls invoke_subgraph for he transformerblock. We are getting into nested HOP/subgraph region. - dynamo_graph_capture passing. currently failing on aot_export_joint. Looks like DTensor x Dynaomo softness.
cc @williamwen42