gh-119692: Add Total UOp 'cost' to PyStats Output
This PR adds an additional table to the output from summarize_stats.py. Namely, a table of (# of times a uop was exectued) * (length of that UOp in machine code), sorted by this value. This makes it clear how much time* is being spent in each UOp, as opposed to just which ones are most frequently executed.
*Machine instruction count is a rough proxy for time, but a really easy one to calculate.
The new table looks like this:
Total Machine Instruction Counts per UOp
| Name | Product | Self | Cumulative | Count | Length (Machine Instructions) |
|---|---|---|---|---|---|
| _COLD_EXIT | 741,336 | 14.4% | 14.4% | 1,173 | 632 |
| _TIER2_RESUME_CHECK | 436,914 | 8.5% | 22.8% | 2,511 | 174 |
| _STORE_FAST_0 | 389,712 | 7.6% | 30.4% | 2,118 | 184 |
| _BINARY_OP_ADD_INT | 327,339 | 6.3% | 36.7% | 983 | 333 |
| _START_EXECUTOR | 231,442 | 4.5% | 41.2% | 1,193 | 194 |
| _LOAD_FAST_0 | 225,144 | 4.4% | 45.6% | 1,416 | 159 |
| ... | ... | ... | ... | ... | ... |
Closes #119692. Tagging @brandtbucher as the requester of this feature, and @mdboom for pystats visibility.
- Issue: gh-119692
Thanks for the feedback @brandtbucher! I've reworked things so that the code size (and data size) of each stencil are dumped as part of the stats, and the summarize script picks them up from there. I've also renamed and re-ordered the table fields to clarify what is being shown.
I've left the Self and Cumulative columns for now - they're the percentage of all the bytes that were jitted by the current UOp (and the running total of the same), if that makes sense, so they're not just repeating the values from an earlier table. But I'm happy to remove or re-label them if that doesn't actually seem useful.
The new table with some sample data looks like:
Total Bytes Executed per JIT'ed UOp
| Name | Count | Stencil Size (Bytes) | Total Size | Self (Total Size) | Cumulative (Total Size) |
|---|---|---|---|---|---|
| _COLD_EXIT | 23,808 | 447 | 10,642,176 | 30.9% | 30.9% |
| _STORE_NAME | 23,800 | 259 | 6,164,200 | 17.9% | 48.7% |
| _START_EXECUTOR | 23,808 | 170 | 4,047,360 | 11.7% | 60.5% |
| _EXIT_TRACE | 23,808 | 151 | 3,595,008 | 10.4% | 70.9% |
| _ITER_NEXT_RANGE | 23,800 | 86 | 2,046,800 | 5.9% | 76.8% |
| _ITER_CHECK_RANGE | 23,808 | 82 | 1,952,256 | 5.7% | 82.5% |
| _CHECK_VALIDITY | 23,800 | 76 | 1,808,800 | 5.2% | 87.7% |
| _GUARD_NOT_EXHAUSTED_RANGE | 23,808 | 72 | 1,714,176 | 5.0% | 92.7% |
| _TIER2_RESUME_CHECK | 23,808 | 66 | 1,571,328 | 4.6% | 97.2% |
| _SET_IP | 23,800 | 40 | 952,000 | 2.8% | 100.0% |
** Updated - see below **
~Right now, there's a bit of a kludge in load_raw_data so that the stencil lengths don't get summed. The lines containing the info about the code stencils look like uops[_MATCH_KEYS].code_size : 74, and this snippet makes sure they're just recorded, not summed.~
# Data about JIT stencils isn't cumulative
if "code_size" in key or "data_size" in key:
stats[key.strip()] = int(value)
else:
stats[key.strip()] += int(value)
~I can see breaking this data into a new prefix (uops[_MATCH_KEYS].data.XXX maybe, and looking for data in the key?) and reworking how its loaded, if that seems cleaner?~
I made a small format change - keys that have metadata in them should simply be set across the input files (instead of summed). So the dumped stencil-length data looks like:
uops[_CONVERT_VALUE].metadata.code_size : 227
uops[_CONVERT_VALUE].metadata.data_size : 280
uops[_COPY].metadata.code_size : 137
uops[_COPY].metadata.data_size : 216
uops[_COPY_FREE_VARS].metadata.code_size : 396
uops[_COPY_FREE_VARS].metadata.data_size : 480
...
I think ideally, there would be some checking that these values are consistent across all the stats files, but currently it'll just use the last value it finds. A bit of a kludge still, but since I would guess it's rare to have stats files hanging around from multiple builds with different jit stencils, perhaps this is fine for now?