dynamorio icon indicating copy to clipboard operation
dynamorio copied to clipboard

DrPoints: a DynamoRIO client for Basic Block Vector (BBV) collection

Open edeiana opened this issue 2 months ago • 2 comments

This issue tracks the implementation of DrPoints, a new DynamoRIO client for collecting Basic Block Vectors (BBVs) of a program execution. A Basic Block Vector (BBV) is a histogram of the execution frequencies of basic blocks (BBs) within a specific interval of instructions (typically 100 million instructions). BBVs are used for phase detection and clustering algorithms to generate SimPoints, which are representative samples of a program's execution.

Since we plan to use the SimPoint Toolkit provided by the SimPoint authors, we want DrPoints to output a BBV file that follows the required format, which is the following:

T:BB_X:TimesExecuted :BB_Y:TimesExecuted ... :BB_Z:TimesExecuted
T:BB_X:TimesExecuted :BB_Y:TimesExecuted ... :BB_Z:TimesExecuted
...
T:BB_X:TimesExecuted :BB_Y:TimesExecuted ... :BB_Z:TimesExecuted

Each line, starting with a "T", represents an instruction interval and contains pairs of a basic block identifier and the number of times it was executed within that interval (times the BB size).

The proposed implementation is a new DynamoRIO client that we call DrPoints. The client will have dr_client_main() register two main events:

  • drmgr_register_bb_instrumentation_event() to instrument basic blocks.
  • drmgr_register_exit_event() to save the collected BBV data to a file upon program termination.

The basic block instrumentation event inserts a clean call to a custom function for each basic block executed (a clean call is expensive, so we plan to inline the counter updates described below and jump to the clean call only when we reach the end of an instruction interval). A global hashtable_t stores the execution frequency of each basic block, with a unique BB ID (sequential, starting from 1, as required by the SimPoint Toolkit) derived from the Program Counter (PC) of one the BB instructions as the key. A global instruction counter tracks the number of executed instructions. When the counter reaches the user-defined interval (e.g., 100 million instructions), the content of the hashtable_t will be stored in a drvector_t, representing the BBV for that interval as a sequence of pairs <BB_ID, Times_Executed>. The counter and hashtable will then be reset for the next interval.

Note that DynamoRIO's basic blocks differ from the classic, compiler definition of basic blocks: static, one-entry one-exit sequence of instructions. Because DynamoRIO doesn't know all branch targets ahead of time, which is required for the compiler definition of basic block, it will tail-duplicate a basic block with a target in the middle if that target is not seen until later on in the execution of the program. We don't expect this to drastically change the resulting SimPoints, but this the main difference of DrPoints compared to the original definition of BBV, so we note it.

edeiana avatar Oct 17 '25 00:10 edeiana

A global hashtable_t stores the execution frequency of each basic block, with a unique BB ID (sequential, starting from 1, as required by the SimPoint Toolkit) derived from the Program Counter (PC) of one the BB instructions as the key.

Generally DR tools use module offsets instead of absolute PC values to uniquely identify library code, to avoid complications from varying load addresses across (and potentially within) runs: the drmodtrack library can be used for that.

There are complexities in supporting general applications. One PC does not always point to the same instruction: unloaded and reloaded libraries (drmodtrack helps here), or dynamically generated code (not always limited to first-class JITs: gcc nested functions; COM trampolines; etc. can end up with small generated snippets in the absence of any real JIT).

derekbruening avatar Oct 17 '25 17:10 derekbruening

Thanks for the note. I put a TODO (among others) in the PR for the initial implementation.

edeiana avatar Oct 18 '25 05:10 edeiana