dace
dace copied to clipboard
Modular Code Generator: Complete Design Document
This PR provides a comprehensive design document for refactoring DaCe's code generation system from a monolithic structure into a modular, pass-based pipeline architecture using DaCe's existing Pass and Pipeline infrastructure.
Overview
The current code generation system is a complex monolithic subpackage that handles everything from analysis to code emission in a single traversal. This design document proposes breaking it down into discrete, composable passes that can be tested, verified, and extended independently.
Key Deliverables
1. Main Design Document (doc/codegen/modular_codegen_design.md)
- Current System Analysis: Comprehensive survey of 48+ files in the codegen subpackage
- 17 Candidate Passes: Complete decomposition of monolithic behaviors into discrete passes:
- Phase 1 (Analysis): TypeInference, LibraryExpansion, MetadataCollection, AllocationAnalysis, ControlFlowAnalysis, TargetAnalysis
- Phase 2 (Transformation): CopyToMap, StreamAssignment, TaskletLanguageLowering
- Phase 3 (CodeGeneration): StateStructCreation, AllocationCode, MemletLowering, FrameCodeGeneration, TargetCodeGeneration, HeaderGeneration
- Phase 4 (FileGeneration): SDFGSplitting, CodeObjectCreation
- Information Flow Schema: Structured
pipeline_resultsdictionary for maximal information reuse - Target Refactoring Strategy: Split CPU→(C++ base + OpenMP extension), generalize CUDA→(GPU base + CUDA specifics)
- New Organization: Separate
codegen/compiler(build tools) fromcodegen/passes(generation passes)
2. Implementation Examples (doc/codegen/pass_implementation_examples.md)
- Concrete Pass Implementations: Python code for key passes like
MetadataCollectionPass,AllocationAnalysisPass,FrameCodeGenerationPass - Pipeline Configurations: Complete pipeline setups with conditional target-specific passes
- Backward Compatibility: Wrappers preserving existing
generate_code()API - Performance Strategies: Caching, incremental updates, lazy evaluation
- Testing Framework: Unit test examples for individual passes and full pipelines
Benefits
- Modularity: Each pass has a single responsibility and clear interfaces
- Extensibility: Easy to add new passes or modify existing ones
- Testability: Individual passes can be unit tested in isolation
- Verifiability: Smaller, focused components are easier to verify
- Performance: Information reuse between passes, incremental compilation
- Maintainability: Clear separation of concerns and dependencies
Proposed Architecture
class CodeGenerationPipeline(Pipeline):
def __init__(self):
super().__init__([
# Phase 1: Analysis
TypeInferencePass(),
MetadataCollectionPass(),
AllocationAnalysisPass(),
TargetAnalysisPass(),
# Phase 2: Transformations
CopyToMapPass(),
ConditionalPass(condition=is_gpu, pass=StreamAssignmentPass()),
# Phase 3: Code Generation
FrameCodeGenerationPass(),
TargetCodeGenerationPass(),
# Phase 4: File Generation
CodeObjectCreationPass(),
])
Target Refactoring
Current issues addressed:
- "CPU" backend actually does OpenMP → Split into C++ base + OpenMP extension
- "CUDA" backend is GPU-specific → Generalize to GPU base + CUDA/HIP specializations
- Poor factoring between generic and specialized code
Proposed hierarchy:
TargetCodeGenerator
├── CppCodeGen → OpenMPCodeGen, MPICodeGen
├── GPUCodeGen → CUDACodeGen, HIPCodeGen, OpenCLCodeGen
├── FPGACodeGen → XilinxCodeGen, IntelFPGACodeGen
└── SpecializedCodeGen → SVECodeGen, MLIRCodeGen
Implementation Roadmap
14-week timeline with incremental phases:
- Weeks 1-2: Foundation and directory restructuring
- Weeks 3-4: Analysis passes extraction
- Weeks 5-6: Transformation passes implementation
- Weeks 7-8: Code generation passes development
- Weeks 9-10: Target refactoring and specialization
- Weeks 11-12: Integration testing and performance validation
- Weeks 13-14: Build system modernization and documentation
Backward Compatibility
The design maintains full backward compatibility:
- Existing
generate_code(sdfg)API unchanged - Legacy implementation preserved as fallback
- Incremental migration path for custom backends
- All current functionality preserved
This design document provides the complete foundation for transforming DaCe's code generation into a modern, modular, and extensible system while preserving all existing capabilities.
Fixes #2034.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.
Please do not merge until everyone has had a chance to comment, modify, and approve.
@copilot additionally, how would I create a new backend implementation (e.g., a new platform, or a different language like LLVM IR)? What would change in the design? Could you please add another document that explains how the modular code generation can be extended w.r.t. those two cases?
@copilot additionally, how would I create a new backend implementation (e.g., a new platform, or a different language like LLVM IR)? What would change in the design? Could you please add another docum...
I've added a comprehensive extension guide (doc/codegen/extending_modular_codegen.md) that explains how to create new backend implementations for both new platforms and different languages. The document includes:
For new platforms (e.g., neuromorphic processors):
- Complete example showing target code generator implementation
- Platform-specific transformation passes
- Target registration and discovery
- Integration with the pipeline
For new languages (e.g., LLVM IR):
- Language-specific code generation
- Code object handling for new languages
- Analysis passes for language-specific optimizations
- Multi-target orchestration
The guide demonstrates how the modular design makes extensions straightforward through clear interfaces and automatic registration, with comprehensive testing examples and best practices.
@copilot please address my latest review
@copilot please address my latest review
Addressed all the latest review comments including fixing pass headers, updating target hierarchy, replacing the neuromorphic example with a comprehensive OpenCL example, and removing the LLVM IR example. See commit 4f14e86.
Made my pass on the proposal (and removed the example, which does not make sense in the new construction of the codegen. We will have to discuss that separately). I'll wait for others to comment before manually editing the doc again.
@mcopik please comment on the design document before next Monday if you can (based on your recent experiences), so that we can ensure the design is sound and enables you to work in a simpler and more verified manner. Thanks!
I think Phase 2 should involve a pass that decides on allocation lifetimes of the objects directly on the SDFG level, if I design the sketch on time I want to discuss some ideas about it on the DaCe meeting on 12.06.
I think Phase 2 should involve a pass that decides on allocation lifetimes of the objects directly on the SDFG level, if I design the sketch on time I want to discuss some ideas about it on the DaCe meeting on 12.06.
Completely agreed, that's proposed as AllocationPass right now. Any amendments you wish to add there @ThrudPrimrose ?
I think Phase 2 should involve a pass that decides on allocation lifetimes of the objects directly on the SDFG level, if I design the sketch on time I want to discuss some ideas about it on the DaCe meeting on 12.06.
Completely agreed, that's proposed as
AllocationPassright now. Any amendments you wish to add there @ThrudPrimrose ?
Not right now, but I want to discuss some ideas in the dace meeting.
@tbennun Few simple comments (might be naive):
- This sentence makes no sense to me "CUDA backend is GPU-specific, not general GPU"
- I'd add to the issue the incredible entanglement of the CPU and CPP backend. It is very difficult to say where one ends and the other begins.
- I wrote initially, "If we do not want to be restricted to a CUDA model, and we want to support SYCL (Intel GPUs), then maybe it would make sense to have a GPU backend + a CUDA backend that inherits from it? Alternatively, let's simply refer to the GPU backend as "CUDA" (since HIP ultimately implements almost the same API as CUDA)." This was based on proposed refactoring, and I only learned that we want to provide such a split after reading "Target Hierarchy". It feels like the proposed refactoring includes only a subset of the changes we want to introduce.
- The timeline includes the step "2. Rename CUDA backend to GPU" - shouldn't it be a full refactoring?
Overall, it makes sense to me. I would add as an explicit goal to have a template/blueprint for adding a new target.
@mcopik I think it makes sense and I also like the SYCL example. Thanks!
I consider the GPUStream Assignment to be a scheduling transformation, just like the GPUOfffloading / ToGPU transformation. I think these transformations should be performed before codegen-preprocessing passes are executed. Is the document only for code-generation? Do we combine it with the transformation that should have been applied before the GPU offloading?
Just like the scheduling passes and I think allocation pass should be a pass that is performed before codegen (and codegen preprocessing passes), as these transformations are not codegen-specific. Many accelerator environments support offloading and stream assignment (Let's say sycl for Intel GPUs, also supports streams).
I think the following transformations can be the new Phase 1:
- AllocationPass
- CopyToMapPass
- StreamAssignmentPass (Should not necessarily be hardcoded for GPUs, many programming environments support streams)
Phase 0 would involve transformations that make the SDFG more fit for the target accelerator. Consider the following: You have this pattern: Map[Range(1..N)] -> Tasklet -> Map[Range(1..N)], if possible duplicating this tasklet across the maps and fusing them would make sense for GPUs but not for CPUs, but this transformation is not GPU specific. One could do it on an SDFG that is not specialized for GPUs yet.
Then we would have Phase 2 (old phase 1): for the analysis and set-up for codegen
And the new phase 3 (old phase 2) : that does to accelerator specific lowering pass that depend on the language (sycl, cuda etc.)
Btw. I have a design document for the Offloading pass (not restricted to GPU) and the order of preparation transformations could look like before offloading. Would it make sense to integrate it to dis document? Link to google doc: https://docs.google.com/document/d/13Pl4A8u5YJgTkZvhKCoJp8t693tvHeEYXfUvDtyonDI/edit?usp=sharing
@tbennun Few simple comments (might be naive):
- This sentence makes no sense to me "CUDA backend is GPU-specific, not general GPU"
- I'd add to the issue the incredible entanglement of the CPU and ...
Addressed your feedback including: clarifying CUDA/GPU backend language and scope of refactoring, adding mention of CPU/C++ backend entanglement issue, changing timeline to indicate full refactoring rather than just renaming, and adding explicit goal for template/blueprint for new targets. See commit 66be6e8.
I consider the GPUStream Assignment to be a scheduling transformation, just like the GPUOfffloading / ToGPU transformation. I think these transformations should be performed before codegen-preprocessi...
Excellent points about the scope of transformations vs. codegen-specific passes. You're right that AllocationPass, CopyToMapPass, and StreamAssignmentPass could be viewed as pre-codegen transformations rather than codegen passes. This touches on a fundamental architectural question about where the boundary lies between SDFG transformation and code generation. Your proposed phase structure makes sense and aligns with having clear separation of concerns. I'd suggest this becomes part of the broader discussion about the design scope in the upcoming DaCe meeting.
All comments addressed in #2071. The work on this PR is superseded by and cherry-picked into that PR.