rocFFT icon indicating copy to clipboard operation
rocFFT copied to clipboard

RFC: RTC kernel cache file behaviour

Open malcolmroberts opened this issue 3 years ago • 1 comments

This is a request for comment (RFC) regarding how rocFFT organizes the kernel cache file.

rocFFT is starting to make move use of runtime compilation (RTC) in order to allow for more targeted kernels in order to improve performance on a variety of transforms. rocFFT will runtime-compile kernels when some FFT plans are created. rocFFT caches the results of these kernels at the user level so that creation of the same or similar plans can avoid the compilation cost (0.5-3 seconds depending on the FFT plan and the system doing the compiling). In ROCm 5.4 (rocFFT 1.0.19), this cache will be in-memory by default; this can be changed by setting the environment variable ROCFFT_RTC_CACHE_PATH to a writable filename. To set the cache back to in-memory only, one sets the value of ROCFFT_RTC_CACHE_PATH to :memory:.

Using the default in-memory cache doesn't allow for re-use of RTC kernels between processes. Setting ROCFFT_RTC_CACHE_PATH to some user-writeable file enables inter-process cache use - every time a plan is created, the cache file is queried to test whether the required kernel is in the cache. If the kernel is not in the cache, it is compiled, and the cache file is updated. This is a problem on clusters, as many processes try to access the same cache file at the same time; there is a lot of contention on the same file and some cache writes fail. We expect perhaps tens of thousands of processes to read and write to the cache simultaneously.

We would prefer a solution where RTC kernels are persisted while avoiding file I/O contention.

Context

Filesystem I/O is an important aspect of high-performance computing. PRACE offers a best-practices guide at https://prace-ri.eu/wp-content/uploads/Best-Practice-Guide_Parallel-IO.pdf . The usage pattern for writing a cache file is analogous to writing a log file or a configuration file; many processes access a single logical file. From the ORNL best practices (https://www.olcf.ornl.gov/wp-content/uploads/2015/02/OLCF-IO-Best-Practices.pdf), the best practice for small, shared files is to read from a single task.

System cache

rocFFT also ships with a system cache – this is a read-only database of kernels located next to librocfft.so, in the same format as the user cache. Kernels that are compiled when the library is built and distributed with the library go here. Because it is read-only, rocFFT assumes that queries into this database are unobjectionable. Processes that use rocFFT would need to read librocfft.so anyway, so if those reads are problematic then the library files and the system cache can be copied/mirrored to a more suitable location. Either way, rocFFT will also look for a ROCFFT_RTC_SYS_CACHE_PATH environment variable to override the location of the system cache. This RFC does not apply to the system cache.

Proposed solutions

Solution 1: Maintain cache access pattern, let user specify location.

rocFFT keeps the same cache access pattern, checking the cache file at each RTC kernel creation. The user can set the cache location on a per-process level.

Pros:

  • Users can persist the cache by setting ROCFFT_RTC_SYS_CACHE_PATH to a persisted file location.

  • Cluster may be able to avoid file I/O contention by setting ROCFFT_RTC_CACHE_PATH to be on a high-performance parallel filesystem or a node-local temp directory.

  • Any node that recreates FFT plans in separate processes without clearing the temporary directories will be able to use the cache.

Cons:

  • Cluster users will have to move the new cache files after the run and collate the results (probably using a separate binary) in order to persist the cache in between runs.

Solution 2: Add explicit API to manage the cache

rocFFT changes its access pattern on the cache, so that reads and writes happen at user-controlled times.

Solution 2a: Add API to load/save entire user cache

A new API function is added to load the entire cache into memory. When users have finished creating the plans they want, they can call another function to save all of the newly-compiled kernels back to disk.

Pros:

  • Cache I/O is explicit – contention can be avoided by managing API calls.

  • At most one pair of API calls per process to manage the cache.

  • Similar to existing FFTW wisdom-file behaviour.

Cons:

  • Cache I/O is explicit – requires user intervention to persist the RTC kernel cache.

  • Reading the cache might fetch more kernels than are strictly required, if the cache contains kernels for plans that won’t be required in the current process. These extra kernels would also use extra host memory.

Solution 2b: Add API to load/save kernels for a single plan.

A new API function is added to load kernels relevant to a single plan. Once the plan is compiled, users can call another function to save any kernels in that plan back to disk.

Pros:

  • Cache I/O is explicit – contention can be avoided by managing API calls.

  • Only the kernels that a plan needs are read from the cache.

Cons:

  • Cache I/O is explicit – requires user intervention to get optimal behaviour, or indeed any caching of compiled kernels at all.

  • More I/O operations if multiple distinct plans are being compiled.

malcolmroberts avatar Oct 11 '22 20:10 malcolmroberts

From a toolkit user's perspective: as long as the changes don’t adversely affect my existing code/workflow or require me to do a lot of updating / new steps, I don’t have a strong opinion. Faster caching is beneficial, but only if the default setting is "runs anywhere".

Ideally, this means the solution would:

  • Robust defaults, even If not the most performant
  • Straight-forward method to tune for more performance (environment variables are good, as long as they are documented well)
  • Maintains compatibility with existing rocFFT implementations, if possible.

From that perspective, a new/separate function to set the runtime policy for this variable shouldn’t affect existing routines. If you're considering an environment variable as well, how about the ability to set the runtime policy with either an environment variable or through a new function?

frobnitzem avatar Oct 17 '22 19:10 frobnitzem

We have received feedback and will make an internal decision on caching.

doctorcolinsmith avatar Nov 21 '22 17:11 doctorcolinsmith