Alexander Monakov
Alexander Monakov
Refcounting pbuffers to detect the problem is too conservative since we can't decrement the counter if a thread simply exits. Using windows instead of pbuffers is problematic since GL is...
It seems some lectures use material from "The Missing Semester of Your CS Education" MIT course, compare e.g. Git lecture notes: https://missing.csail.mit.edu/2020/version-control/ vs. [current version in this repo](https://github.com/danlark1/hse_missing_cs_education/tree/ee3ba421107678c156f75695f843278b05fcbd23/version_control). If so,...
This is a report regarding the uops.info table, specifically latency figures for in-place zero extension. There are separate experiments for `mov r32, ` (latency 0) and `mov r32, ` (latency...
Let's catch up with uops.info additions.
As far as I see, it is neither funny nor helpful, and it gives no actionable feedback to the reporter. I am referring to instances where a response to a...
Integer `pcmpeq*` with source=dest sets destination to all-ones without dependency on source (but still occupies an execution unit). For example, the following loop runs at one cycle per iteration on...
At the moment uica removes p0 as a possible execution port on hsw/skl for a branch early on: https://github.com/andreas-abel/uiCA/blob/9cbbe931247f45f756738cf35800b5e8dff7bbb0/convertXML.py#L95-L96 I'd like to suggest that this should be done only for...
On Zen 4, summary of vpternlogd latency experiments is given as Latency operand 1 → 1: 1 Latency operand 2 → 1: 2 Latency operand 3 → 1: 1 https://uops.info/html-lat/ZEN4/VPTERNLOGD_ZMM_ZMM_ZMM_I8-Measurements.html...
You note the effect for Skylake on the wiki ("Minimum store-forwarding latency is 3 on new(ish) chips, but the load has to arrive at exactly the right time to achieve...