calyx
calyx copied to clipboard
[Profiling] Tracker Issue for Profiling first steps
This issue lists out steps for profiling! (Mostly so I can organize my TODOs.) Will update as I move along.
Inspections & QoL improvements to profiler
- Running the profiler on more (big) programs
- [x] Make a test suite for profiling
- [ ] Use Cider2 benchmarks for "real programs"
- [ ] Brainstorm ways to give actionable feedback for bigger programs
- [ ] Fix profiler tests CI
- Have runt test print out more things to properly debug
- Inspections
- [x] Look through the waveforms for the "weird behavior"/mystery cycles
- [x] Minimization to see different behaviors
- [x] For any FSM-managed group, collect two pieces of info and report both to the users. The diff would display the mystery redundant cycles that Calyx is consuming.
- ground truth (
go,doneports) - FSM (what Calyx says is allowed to run)
- ground truth (
- QoL improvements:
- [ ] Connect invokes & pars (cond groups for whiles?) with user identifiable info (line numbers?)
- Visualizations
- [x] Check out README: https://github.com/Auterion/embedded-debug-tools/tree/main/ext/orbetto
- [x] Check out: https://ui.perfetto.dev/
- [x] Check out: https://profiler.firefox.com/
- [x] Make a first-pass visualization for cycle counts
- [x] Find tools that display flame graphs (rather than a timeline view)
First Pass: Cycle-level performance info at the Calyx level
- Metadata generation
- [x] Print JSON from TDCC (add another pass option to print JSON instead of the dump)
- [x] Write JSON to file
- [x] Instead of hacking through the enable assignment, we directly keep track of group to FSM state mappings
- [x] Refactor this by directly building a
FSMStateInfowhen processing enables.
- [x] Refactor this by directly building a
- [x] Fix JSON emission to output a single JSON file at the end (when there are multiple TDCC groups, like in
language-tutorial-iterate, the individual TDCC FSMs overwrite each other) - [x] ~Right now (for optimization purposes?) the first group is morphed with the setup. Want to differentiate for more accurate counts of the first group.~
- [ ] Merge
dump-fsmanddump-fsm-jsonfor TDCC - [x] Add FSM name information to JSON
- [x] If the par arm/component does not yield a FSM, need to output corresponding information (check
goanddoneinstead!) - [x] We want information about parentage (if a FSM is managing a par arm, we want to know what the par itself is)?
- Loading in the trace
- [x] Figure out what tool to use?
- Kevin's Wellen library for Surfer
- Some Python libraries for a first pass:
- pyDigitalWaveTools
- pyvcd
- vcdvcd :heavy_check_mark:
- [x] Make first pass script for reading vcd and outputting group lengths based on FSM values
- [x] Remove assumption that there is only one FSM
- [x] Remove assumption that each cycle takes 10ms (have a counter mechanism of how many cycles passed between X ms and Y ms)
- [x] Sample signals on rising/falling clock edge (comment)
- [x] Check out example programs with parallelism
- [x] Produce summary: compute the total cycles that a given group was active, the number of times it was active (the number of segments), and the average running time (which is just the quotient of the previous two values).
- [x] Multi-component programs:
- [x] Update TDCC to write one JSON file reflecting all components
- [x] Output cell names info using a backend instead of TDCC?
- [x] Fix hardcoding of
"TOP.TOP.main.go"
- [ ] Find edge cases where timing info is not actionable
- [x] Don't start counting clock cycles until
main.gois 1
- [x] Figure out what tool to use?
- Make flame graphs
- [ ] There is probably a library out there to generate a flame graph.
- Flame graphs resource: https://www.brendangregg.com/flamegraphs.html
- JavaScript library: https://github.com/spiermar/d3-flame-graph
- https://profiler.firefox.com/
- https://ui.perfetto.dev/
- [ ] There is probably a library out there to generate a flame graph.
- [x] Write wrapper script around the pipeline
Thanks for opening this @ayakayorihiro! Could you add the "Tracker" label to this issue?
Thanks @rachitnigam ! Just added the tracker label, will keep in mind for next time :)
- Remove assumption that each cycle takes 10ms (have a counter mechanism of how many cycles passed between X ms and Y ms)
For synchronous designs like the ones Calyx produces I generally recommend sampling signals on a rising or falling clock edge (depending on how the testbench works). That way you stay independent of the actual timing. Here is how I find the sample point in a rust implementation: https://github.com/ekiwi/rtl-repair/blob/71e1afc0b9a2327d008b46acd415cf3f0343a938/scripts/osdd/src/main.rs#L113
Similar thing but with the vcdvcd library in python:
https://github.com/ekiwi/rtl-repair/blob/861e244c599e682efe5dbd8e3295c3b8e3590a34/scripts/calc_osdd.py#L215
https://github.com/ekiwi/rtl-repair/blob/861e244c599e682efe5dbd8e3295c3b8e3590a34/scripts/calc_osdd.py#L195
Thanks @ekiwi ! I'll take a stab following your work with the vcdvcd library :)