kokkos-tools icon indicating copy to clipboard operation
kokkos-tools copied to clipboard

Using tools induces undesirable fences in certain cases

Open vlkale opened this issue 3 years ago • 2 comments

Description of Bug

When using profiling tools of Kokkos to profile a Kokkos application, Kokkos invokes memory fences that are unneeded. This is particularly the case when a memory fence in a Kokkos application code is itself being profiled. These unneeded memory fences used by the tool are not desirable because they slow down execution of the application when the Kokkos tools are being used and do not reflect the actual performance of the application code. If this problem is not fixed, then users of profiling tools will not obtain an accurate performance profile of the application's execution. Changing the Kokkos tools libraries by removing, or filtering out, the memory fences under particular context where they are not needed will not impact the correctness of the application code. A solution is to filter out the unneeded fences under particular contexts.

Details of the Problem

This problem is applicable to Kokkos v. 3.7.0 on any platform. More information and example code for reproducing this problem will be added soon by Vivek.

From @crtrott and expanded on by @vlkale, there are actual two orthogonal [sub]issues within this GitHub Issue:

  • Sub-Issue 1: Some tools need to turn off fencing of tool callbacks, depending on how the tool is used. These tools (like the space-time-stack and simple kernel timer) need to be able to read an environment variable which sets the ON/OFF behavior for the tool setting callback. Specifically, spaceTimeStack and the vendor tools need that. [IOW, we need to eliminate memory fences so that Kokkos tool callbacks can capture the state of environment variables which control behavior of Kokkos-associated tools, e.g., a simple kernel timer, and other vendor tools.]

  • Sub-Issue 2: If tool callback fencing is on and the tool implements the callback for fencing, we want to filter out fences which are solely due to callback fencing [as they will slow down execution of the application]. Note these are different things: a callback for "beginParallelFor" invokes a Kokkos global fence if callback fencing is on i.e., calls Kokkos::fence(), but a global fence also invokes the tools callback for "beginFence" and "endFence". [IOW, we need to remove the extraneous memory fences of runtime functions of the Kokkos-expanded code, which invoked by Kokkos's tool callback fencing mode being on, since the tool is implementing the callback for fencing itself].

To illustrate: with tools callback fencing on, the following Kokkos program:

Kokkos::fence();

Kokkos::parallel_for(...);

expands conceptually into:

// Kokkos::fence(); 
beginFence();  // <--- Note: could have a preFence() to set up data structures and bookkeeping for fence
fence_impl();   //  <--- We need this fence and this should stay
endFence();

// Kokkos::parallel_for();

beginFence();
fence_impl();  //  <--- This fence is extraneous from tools fencing being on and may slow down code execution 
endFence();
beginParallelFor();
parallel_for_impl();
beginFence();   
fence_impl();  //   <-- This fence is extraneous from tools fencing being on and may slow down code execution 
endFence();
endParallelFor();

With tools [callback] fencing off, the above Kokkos code expands into:

// Kokkos::fence();
beginFence();
fence_impl();
endFence();

// Kokkos::parallel_for();
beginParallelFor();
parallel_for_impl();
endParallelFor();

The fences induced by beginParallelFor and endParallelFor can have specific string labels attached, so we should be able to filter [i.e., automatically remove] these fences out inside the tool based on the attached string labels. Note that there may be a preFence() function before the beginFence() function. This also may need to be removed.

Any tool [of Kokkos] which currently implements the fence callback probably should do this filtering.

Additional Info

Discussed with @crtrott on Oct. 11th, and discussed in end of Kokkos team meeting on October 12 by Damien and @crtrott. More details on problem and reproducing the error being added by Vivek. Thanks to @crtrott for initial reproducer and details and specifics on the problem, splitting the problem into two.

vlkale avatar Oct 18 '22 04:10 vlkale

The fences induced by beginParallelFor and endParallelFor have specific string labels attached, so we should be able to filter [i.e., automatically remove] these fences out inside the tool based on the attached string labels.

Based on my experience dealing with these sort of things in tools, basing the filtering on strings will result in at least 1-2 weird bugs at some point down the road. For example, what if the user labels their parallel reduce with "parallelReduce-after-parallelFor"? (i.e. they inadvertently use the keyword you are depending on for the logic workflow).

Might want to consider something like a "ToolsState" struct instance/singleton which is passed/accessible by the tools callbacks and the callbacks simply update, say, a depth field when starting a callback and decrement the depth field when the callback is completed. Something like that would allow the fence callback to detect that it is inside another callback and therefore shouldn't invoke it's callbacks.

jrmadsen avatar Oct 19 '22 03:10 jrmadsen

@jrmadsen Agreed that it not a good long-term solution to filter out desired Kokkos runtime function calls using a particular string labels with the assumption that the user (i.e., Kokkos programmer) provides the labels to associated Kokkos language constructs, e.g., Kokkos::fence(). Filtering out based on a string is an initial/rough solution, but I would say we do need to consider

Yes, a ToolsState singleton sounds good, having talked with @dalg24 about this earlier this past Wednesday. A good test of this ToolsState would be the Kokkos spaceTimeStack. I will consider this as an alternative with @crtrott soon (in a week from now, he is on vacation though may be checking intermittently).

Please don't hesitate to provide any other comments as we develop this.

vlkale avatar Oct 22 '22 01:10 vlkale

Note we need to remove extraneous fence in the case that a programmer of Kokkos executes a loop containing dependences on heterogeneous processors. For example, a programmer can decidesto split a Kokkos::parallel_for() working the same view across a CPU and GPU and assuming dependences between the GPU's loop and CPU's loop. I don't know how many examples we have of this in Kokkos kernels. This is a particular though special use case of the above.

For example, if a programmer does the following for a parallel for loop with starting index 0 and ending index n to split a loop across CPU and GPU, like this:

Kokkos::parallel_for(ViewA, CUDABackend, ...., 0, k-1);
Kokkos::fence(); // <-- Need this fence for correctness 
Kokkos::parallel_for(ViewA, OpenMPCPUBackend, ...., k, n);  

would expand conceptually to:


beginFence();
fence_impl(); 
endFence();  
beginParallelFor();
parallel_for_impl( ViewA, OpenMPCPUBackend, ...., 0, k-1);
endParallelFor();
beginFence();
fence_impl();  //  <--- This fence is extraneous from tools fencing being on and may slow down code execution 
endFence(); 


// Kokkos::fence(); 
beginFence();  // <--- Note: could have a preFence() to set up data structures and bookkeeping for fence
fence_impl();   //  <--- We need this fence and this should stay
endFence();


beginFence();
fence_impl();  //  <--- This fence is extraneous from tools fencing being on and may slow down code execution 
endFence();
// Kokkos::parallel_for();
beginParallelFor();
parallel_for_impl( ViewA, OpenMPCPUBackend, ...., k, n);
endParallelFor();
beginFence(); 
fence_impl();  
endFence();

The above code should be changed to:


beginFence(); 
fence_impl();  
endFence();

beginParallelFor();
parallel_for_impl( ViewA, OpenMPCPUBackend, ...., 0, k-1);
endParallelFor();

// Kokkos::fence(); 
beginFence();  // <--- Note: could have a preFence() to set up data structures and bookkeeping for fence
fence_impl();   //  <--- We need this fence and this should stay
endFence();

// Kokkos::parallel_for();
beginParallelFor();
parallel_for_impl( ViewA, OpenMPCPUBackend, ...., k, n);
endParallelFor();

beginFence(); 
fence_impl();  
endFence();


vlkale avatar Nov 02 '22 17:11 vlkale

The fix is discussed and tested, and this is part of kokkos-tools develop branch now.

vlkale avatar Dec 23 '22 17:12 vlkale