opencl-intercept-layer
opencl-intercept-layer copied to clipboard
Buffer Hashes / Compressed buffer dumps
Observed Behavior
If one enables buffer dumps for multiple (may be all) buffers, the dumps occupy large space, running into several giga bytes. The need for such a large disk space can rule out performing such a run on many systems. Compressing the dumped files after the entire run is also impractical.
Desired Behavior
- If it is possible to output buffer hashes (e.g. sha512) as an independently selectable knob, it can provide a way to collect a low-footprint signature of all the buffers.
- Buffer Compression as an independently selectable knob will provide another equally useful way to collect a medium-footprint signature of all the buffers.
Steps to Reproduce
Execute an OpenCL application under CLIntercept control, enabling all buffer dumps
Buffer hashes (option 1) should be fairly straightforward. I'll look into adding this, unless somebody else gets to it first.
Buffer compression (option 2) sounds interesting but I have far less experience in this area. Does anyone have suggestions where to begin if we decide to pursue this option? My main selection criteria is to minimize dependencies on external libs (it'd probably stay optional and be included only when requested), to be cross-platform (at least Windows, Linux, and OSX) and to have a reasonably permissive license.
Sounds good. I have a pilot for hashes if that helps. Also one with compression using I think bzip2.
I had similar goals. I wanted all the kernel argument binary values before enqueues (e.g. kernel arguments). This would include scalars and cl_mem object contents. We'd intercept in clSetKernelArg, and might need to read buffer contents.
My position on the buffer issue. Use value binary equality/identity. For any binary we dump, we come up with a unique mapping by value to a file based on the hash. Then we reference that binary symbolically. The effect of this is that if a kernel enqueues buffers with the exact same values over and over and over (hint hint: as in benchmarks), we intern the results. E.g. we could dump these in C:\Intel\CLIntercept\my_process_exe_5121\values* Either scalars or cl_mem contents. The API trace references those files.
- This works either way with probable soft correctness or total correctness (preventing hash collisions).
- If we wanted to, we could map buffers' hash identity (e.g. sha512) to friendlier names (shorter).
- I don't think one needs sha512. You could probably get away with far less, but I don't feel that strongly. - I just don't want to deal with 512 character identifiers or names in logs and filenames.
- Compression probably won't be needed if we use value equality. Benchmarks are going to be extremely repetitive and using value identity on buffers will trump any compression algorithm that blurts out the same bits over and over and over again. A benchmark that makes 1000 calls might only have 20 buffers saved. That's far better than any {b,g}zip. Still I wouldn't care if someone put compression on top of this too.
One potential problem that might lead to an explosion of binary files is: what happens when a buffer in setKernelArgs is undefined (e.g. an output buffer that hasn't been initialized might have whatever junk the OS gave us from the VM allocator). But even so drivers may be careful enough to give us zeros or some other consistent pattern. The hook in setKernelArg would need to call clReadBuffer or clMapBuffer. If you're intercepting this elsewhere you might have even more information and be able to avoid this problem.
Yes, dumping identical buffers only once and referencing those symbolically does do a significant space reduction. So, it may address many cases. In my pilot, I did observe some applications generating hundreds of distinct buffers - in these cases a compression may be useful as the next option.
Thinking about this a bit more. Hashing the memory objects is a good first step regardless, since this is effectively what we'll need to do for binary equality. Three interesting options could be:
- Dump the entire contents of the memory object before and/or after each enqueue. This is the implemented behavior today.
- Hash the contents of each memory object and emit a file with the hash before and/or after each enqueue. This minimizes required storage and is sufficient if you only want to know if the buffer contents are the same or different from run-to-run, but do not care about the actual buffer contents.
- Hash the contents of each memory object, emit a file with the hash before and/or after each enqueue, and then also dump the contents of the buffer to a file if and only if a file for the hash doesn't already exist (or if the contents of the buffer doesn't match files with the specified hash, if collisions are possible). This provides information about the buffer contents that is not available with (2), but without requiring as much storage as (1).
Note that there is already a control to initialize the contents of buffer memory objects:
https://github.com/intel/opencl-intercept-layer/blob/master/docs/controls.md#initializebuffers-bool
Similar controls could be added for buffers and SVM allocations, to further reduce the likelihood of uninitialized memory objects leading to non-deterministic hashes / dumping.