[Compiler Toolkit] Enable nested_compile_region on TransformerBlock

Open SherlockNoMad opened this issue 2 months ago • 1 comments

Need to run with fix in https://github.com/pytorch/pytorch/pull/166702

NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4

Current output: P2016557983

Observations

I see each TransformerBlock becomes one subgraph, look for subgraph_0, subgraph_2... This is not what we want. we should see 1 instance of subgraph_0, and multiple invoke_subgraph nodes on the same subgraph_0, with different layer weights.
Due to AC, we also have hop.tag_activation_checkpoint(subgraph_1), where subgraph_1 internally calls invoke_subgraph for he transformerblock. We are getting into nested HOP/subgraph region.
dynamo_graph_capture passing. currently failing on aot_export_joint. Looks like DTensor x Dynaomo softness.

Oct 31 '25 06:10 SherlockNoMad

cc @williamwen42

Nov 05 '25 18:11 miladm