celerity-runtime
                                
                                
                                
                                    celerity-runtime copied to clipboard
                            
                            
                            
                        Profile and optimize frontend (from CGF submission to serialized graph)
#107 has frontend benchmarks that do not instantiate the entire runtime.
Low hanging fruit:
- Currently, 13% of the time is spent in iostreams for creating the CDAG debug labels. Maybe this should happen in graph printing, not generation (there is already a PoC implementation for that)
 - malloc/free pairs have significant impact deep in dependency checking. We can potentially eliminate a lot of convenience allocations (
get_accessed_bufferset al) 
Requires further investigation:
- Use a bump allocator for intrusive graphs to improve locality, e.g. a heuristically sized pool per horizon + fallback as needed (maybe also for the dependencies vectors?)
 - command / task map rehashing is also a noticeable factor, but memory is abundant - 
reserve()them to avoid stalls