onnx-mlir icon indicating copy to clipboard operation
onnx-mlir copied to clipboard

Profiling into a buffer

Open AlexandreEichenberger opened this issue 4 months ago • 0 comments

The goal of this PR is to eliminate the uncertainty around profiling ONNX-MLIR models.

This PR changes the -profile-ir behavior slightly. Now this option only provides timing information. To also get the signature of the operations (aka the tensor sizes at runtime), we have to use the profile-ir-with-sig. Same options apply (namely ONNX or ZHigh or possibly ZLow).

The buffering is done in a record that keeps track of 2 strings (operation and node name) as well as 2 tags (for before after) and 2 timestamps. Buffer is named TimeRecord in OMInstrument.inc. Normally, the buffer is printed by a call to the externally visible omInstrumentPrint() call. That interface has been extended to the python object under the print_instrumentation method to our execution session objects. This call was added to the RunONNXModel.py and can be seen there.

So ideally, the buffer is filled during execution and is printed after execution is complete, if desired. (RunONNXModel always aims to print the buffer, but when no buffer entries were entered (aka the -profile-ir option was not selected), nothing happens.

Now the buffer might overfill, in which case printing will happen during inference time. Buffer size is set to 2x what is needed for Granite 2B, so at this time it should e sufficient. Variable is named MAX_TIME_RECORD_BUFFER and can be manually increased if too small in future tests.

A Granite run show the following stats: The sum of the timed operation amount to 467.8 ms, and the time measured for the inference (using the OM_DRIVER_TIMING timing support in OMInstrumentHelper.h 469.7 ms, so about a 1.9ms discrepancy (possibly including the time to call the timers, plus calling overheads). The printing of the instrumentation took 2.7ms. When involving IO, there can always be "unexplained" time; when using buffers, the time to fill the time entries into the buffer should be pretty constant. This filling also occurs during warmup runs, so the buffer should be at least mapped into the address space.

Note that when the signature is also requested, the printing of the signature happens during inference between the timing of operations; there will be a larger discrepancy between the sum of the operations and the total time spent in the inference.

Note also that the time reported by RunONNXModel.py includes additional overheads linked with transitioning from the python data to the internal C/C++ data structures.

AlexandreEichenberger avatar Dec 03 '25 19:12 AlexandreEichenberger