cpython gh-119692: Add Total UOp 'cost' to PyStats Output

This PR adds an additional table to the output from summarize_stats.py. Namely, a table of (# of times a uop was exectued) * (length of that UOp in machine code), sorted by this value. This makes it clear how much time* is being spent in each UOp, as opposed to just which ones are most frequently executed.

*Machine instruction count is a rough proxy for time, but a really easy one to calculate.

The new table looks like this:

Total Machine Instruction Counts per UOp

Name	Product	Self	Cumulative	Count	Length (Machine Instructions)
_COLD_EXIT	741,336	14.4%	14.4%	1,173	632
_TIER2_RESUME_CHECK	436,914	8.5%	22.8%	2,511	174
_STORE_FAST_0	389,712	7.6%	30.4%	2,118	184
_BINARY_OP_ADD_INT	327,339	6.3%	36.7%	983	333
_START_EXECUTOR	231,442	4.5%	41.2%	1,193	194
_LOAD_FAST_0	225,144	4.4%	45.6%	1,416	159
...	...	...	...	...	...

Closes #119692. Tagging @brandtbucher as the requester of this feature, and @mdboom for pystats visibility.

Issue: gh-119692

May 28 '24 21:05 JeffersGlass

Thanks for the feedback @brandtbucher! I've reworked things so that the code size (and data size) of each stencil are dumped as part of the stats, and the summarize script picks them up from there. I've also renamed and re-ordered the table fields to clarify what is being shown.

I've left the Self and Cumulative columns for now - they're the percentage of all the bytes that were jitted by the current UOp (and the running total of the same), if that makes sense, so they're not just repeating the values from an earlier table. But I'm happy to remove or re-label them if that doesn't actually seem useful.

The new table with some sample data looks like:

Total Bytes Executed per JIT'ed UOp

Name	Count	Stencil Size (Bytes)	Total Size	Self (Total Size)	Cumulative (Total Size)
_COLD_EXIT	23,808	447	10,642,176	30.9%	30.9%
_STORE_NAME	23,800	259	6,164,200	17.9%	48.7%
_START_EXECUTOR	23,808	170	4,047,360	11.7%	60.5%
_EXIT_TRACE	23,808	151	3,595,008	10.4%	70.9%
_ITER_NEXT_RANGE	23,800	86	2,046,800	5.9%	76.8%
_ITER_CHECK_RANGE	23,808	82	1,952,256	5.7%	82.5%
_CHECK_VALIDITY	23,800	76	1,808,800	5.2%	87.7%
_GUARD_NOT_EXHAUSTED_RANGE	23,808	72	1,714,176	5.0%	92.7%
_TIER2_RESUME_CHECK	23,808	66	1,571,328	4.6%	97.2%
_SET_IP	23,800	40	952,000	2.8%	100.0%

** Updated - see below ** ~Right now, there's a bit of a kludge in load_raw_data so that the stencil lengths don't get summed. The lines containing the info about the code stencils look like uops[_MATCH_KEYS].code_size : 74, and this snippet makes sure they're just recorded, not summed.~

# Data about JIT stencils isn't cumulative
if "code_size" in key or "data_size" in key:
    stats[key.strip()] = int(value)
else:
    stats[key.strip()] += int(value)

~I can see breaking this data into a new prefix (uops[_MATCH_KEYS].data.XXX maybe, and looking for data in the key?) and reworking how its loaded, if that seems cleaner?~

May 31 '24 18:05 JeffersGlass

I made a small format change - keys that have metadata in them should simply be set across the input files (instead of summed). So the dumped stencil-length data looks like:

uops[_CONVERT_VALUE].metadata.code_size : 227
uops[_CONVERT_VALUE].metadata.data_size : 280
uops[_COPY].metadata.code_size : 137
uops[_COPY].metadata.data_size : 216
uops[_COPY_FREE_VARS].metadata.code_size : 396
uops[_COPY_FREE_VARS].metadata.data_size : 480
...

I think ideally, there would be some checking that these values are consistent across all the stats files, but currently it'll just use the last value it finds. A bit of a kludge still, but since I would guess it's rare to have stats files hanging around from multiple builds with different jit stencils, perhaps this is fine for now?

May 31 '24 18:05 JeffersGlass