XRT OpenCL kernels keep reference count of their arguments until they are released

When a buffer is created using the OpenCL API, its reference count is set to 1. Upon invocation of clSetKernelArg(kernel, index, buffer) function, the buffer's reference count is incremented. However upon completion of the kernel execution task, the reference count is not decremented. A call to clReleaseMemObject(buffer) won't actually release the buffer since its reference count will not be zero.

Also Khronos states the following under 'Notes' section:

"Implementations shall not allow cl_kernel objects to hold reference counts to cl_kernel arguments, because no mechanism is provided for the user to tell the kernel to release that ownership right. If the kernel holds ownership rights on kernel args, that would make it impossible for the user to tell with certainty when he may safely release user allocated resources associated with OpenCL objects such as the cl_mem backing store used with CL_MEM_USE_HOST_PTR."

Finally, an example case that produces a CL_OUT_OF_RESOURCES error:

PLRAM of 2MB available
Allocate buffer A (2MB) in PLRAM
Invoke and await execution of OpenCL kernel using the buffer allocated in (2)
Release buffer A
Allocate buffer B of size <= 2MB in PLRAM -- (CL_OUT_OF_RESOURCES)

[ Tested with many XRT versions including 2.9.0 ]

Feb 01 '21 12:02 jstamel

The deviation from spec is true. We are aware of this. We will however not change this.

XRT's implementation caters to uses cases that allows kernels to be re-executed without changing all argument, for example just scalar args can be changed for sub-sequent execution. The internal ref count on the cl_mem is released when the cl_kernel is released or when the kernel argument holding the cl_mem object is re-set.

Feb 03 '21 21:02 stsoe

Sorry @jstamel , I did not mean to close. I wanted to ask what makes it difficult or impossible to control the lifetime of the cl_mem objects by managing the lifetime of the kernel objects?

Feb 03 '21 21:02 stsoe

Thanks @stsoe for the explanation.

The example I included in my first comment is how we actually came across this issue, meaning that we were relying on the clReleaseMemObject() and we were getting CL_OUT_OF_RESOURCES error. We did try the alternative of "releasing" and "re-creating" a kernel object, as a whole, before allocating a new buffer but this would introduce up to 1ms overhead.

Wouldn't it make sense, at least, to provide a way to "reset" specific kernel arguments? Negative (-1) arg_size or NULL arg_value on the clSetKernelArg API call could be it.

Not being able to control the state of the system (including the memory utilization) reverts the benefits of an event-driven software pipeline.

Μore insight on what we are doing:

We have built an abstraction layer for HW resources including FPGAs that is responsible for scheduling and orchestrating requests for acceleration from users in the most efficient way to maximize system throughput. In case of a single Alveo board and two different applications sending for example multiple requests for inference on the same accelerator, the orchestration layer would serialize those requests and ensure that both applications would share the underlying HW resources while performing SW optimizations (e.g. pipelineing). There are mainly 3 stages for a HW accelerated task: a. allocate and send data to the device b. execute kernel and wait until the kernel is done c. transfer data from the device to host and release buffers

In an ideal SW pipeline, (a) would be performed on current task right after the previous task had finished from (a) and moved on to (b). Now in case that a task buffer is big enough to fit in a memory bank, the system must wait and try to evict memory that is allocated but not being used at the moment. So the only alternative here (since we refer to different cl_mem objects - each task may reference the same or different cl_mem objects) is to create the kernel on every task submission and release it after the execution has finished, to ensure that a call to clReleaseMemObject() would actually remove the buffer from the device memory, leading us to this ~1ms overhead per task execution.

I hope this clarifies why we need to find a way of at least resetting a kernel's arguments.

Feb 04 '21 11:02 jstamel

@jstamel Thank you for the detailed explanation.

If the "task buffer" fills the memory bank, then another buffer cannot be allocated until kernel finishes execution and is done using the task buffer, right? So (a) can not overlap with (b)? But I understand the predicament, you have to release the kernel in order to make room for the next task buffer. I am surprised if releasing and recreating the kernel is the cause of the 1ms overhead. I wonder if the overhead is related to the actual true release of the buffer itself?

The current behavior is to avoid recreating the kernel and resetting kernel arguments. It is optimized for reuse of already allocated buffers that are assigned to kernels. That said, providing a way to reset the kernel argument should be easy enough; I imagine we can set a nullptr cl_mem object as argument, but I don't think cl_mem mem = nullptr; clSetKernel(...&mem); works today; I will check and make it work if necessary. But if the mem bank has at least 4k extra room, then an easy work-around would be to allocate a sentinel cl_mem and set that as argument to clear the original one.

Feb 04 '21 16:02 stsoe

@stsoe Exactly, (a) cannot overlap with (b) in such case. The overhead does not include the release of a buffer, just the clReleaseKernel() and clCreateKernel() calls, right after the kernel has finished its execution. In other words no clReleaseMemObject() is called so far. Also, please find below more detail on the measured time for 2 tasks:

Release Kernel 0.004000 ms, create kernel 0.896000 ms, total: 0.900000 ms Release Kernel 0.007000 ms, create kernel 1.055000 ms, total: 1.062000 ms

I totally understand your approach and is well aligned with the fact that we also want to avoid releasing and re-creating a kernel. We also thought about the workaround with a small buffer but we cannot tell for sure if it is going to work since it depends on the user's input data size. Indeed cl_mem mem = nullptr; clSetKernel(...&mem); doesn't work at the moment, but if you could make it work it would be a solid workaround.

Feb 05 '21 11:02 jstamel

Hello @jstamel , Is there any reason you cannot reuse the buffer instead? The new buffer allocation through the setting of the kernel arguments only possible current kernel execution is finished. So in that case, you can just reuse the same buffer.. Write on the Buffer itself or the mapped pointer for the next execution. Even if the next execution needs a different amount of data that can be adjusted by offsets... I guess this should be more efficient instead of creating new buffers. no?

Feb 05 '21 15:02 uday610

Hi @uday610,

Your approach makes sense and is reasonable but I think that it doesn't apply in the general case. For example, let's say you have a PLRAM of 2MB available and that you have allocated a cl_mem buffer of 1.5MB. On the next execution you need a 2MB buffer on the PLRAM. Is it possible to extend the cl_mem buffer size? On the other hand, you can't just allocate all the memory in case you come across this issue.

As a workaround to enforce the Khronos specification on our side we implemented the following functionality:

On kernel creation we get the kernel's number of arguments and create an empty list of that size
Before setting a kernel argument we look into the list to see if there was already a buffer associated with this kernel argument. If it was, we retain that (previous) buffer. We then set the kernel with the current argument, we invoke clReleaseMemObject() to decrement its reference count and then store it to the list
Prior to invoking clReleaseKernel() function we retain all the buffers available in our arguments list to increment their reference count.

Example:

create buffer A (ref.A = 1) create buffer B (ref.B = 1)

Set kernel argument (no previous argument for the kernel argument index exists)

retain is not invoked clSetKernelArg(kernel, index, sizeof(cl_mem), &A) (ref.A = 2) clReleaseMemObject(A) (ref.A = 1)

Set kenrel argument (previous buffer for this index exists)

clRetainMemObject(args_list[index]) (ref.A = 2) clSetKernelArg(kernel, index, sizeof(cl_mem), &B) (ref.A = 1, ref.B = 2) clReleaseMemObject(B) (ref.B = 1) args_list[index] = B

Release Kernel (retain all buffers in args_list)

for (cl_mem arg: args_list) clRetainMemObject(arg) (ref.B = 2) clReleaseKernel(kernel) (ref.B = 1)

Release buffers (their refcount equals 1)

clReleaseMemObject(A) clReleaseMemObject(B)

Now in case A was released before (2), then in (2) clSetKernelArg() will try to decrement the refcount of that non existent buffer. From our tests it seems that the buffer is checked prior to decrementing its reference count so no errors are reported. However, if we had the option to reset a kernels argument then we would do so before releasing buffer A.

Feb 08 '21 10:02 jstamel

XRT XRT copied to clipboard

OpenCL kernels keep reference count of their arguments until they are released

XRT
XRT copied to clipboard