Solver/Solution framework
This is a copy of a presentation for MIOpen team I held a couple of years ago, when we've introduced and implemented the Solver/Solution architecture. It does not cover the recent additions like GetWti() and Invokers. I would like to make it available for all MIOpen developers, including collaborators.
Though a bit outdated, this should provide a good overview of how device code is abstracted away from the rest of the library.
1. Intent
Problem: Variety of convolution kernels
- Different formats (OCL, asm, binary)
- Capable to do different (but often intersecting) subsets of convolution configs
- Some may benefit from auto-tuning, some not.
- Each kernel may require its own set of #defines for build
- There are cases when two (or more kernels) needs to be run to do the job
- Kernels need different parameters during invocation
Experience shows that straightforward attempts to support such a set of kernels result in host code which is large, fragile, difficult to develop and maintain. You may see leftovers of this in convolutionocl.cpp.
Provide abstractions which able to represent in the single place all the information required to
- Select kernel(s) which can do the required job
- Build these kernel(s)
- Run kernels
- Auto-tune kernel(s)
Such abstractions allow working with all convolutions in unified manner. Currently, there are:
- ProblemDescriptions
- Solvers
- Non-searchable
- Searchable, which implement auto-tune internally (legacy OpenCL)
- Searchable, which employ Generic Search
- Solutions
- PerformanceConfigs (paired with searchable Solvers)
2. Problem Description and Context
2.1. Problem Description for an operation, e.g. conv::ProblemDescription
This is an input for the Solver.
- This is an object which represents a set of parameters that shall unambiguously describe the specific library primitive. Example for convolutions:
struct ProblemDescription
{
int n_inputs = 0;
int in_height = 0;
int in_width = 0;
int kernel_size1 = 0;
int kernel_size0 = 0;
int n_outputs = 0;
...
struct Direction direction; // has members like IsBackwardData() etc
...
}
- If perf-db is used, the object must provide also "void Serialize(std::ostream& stream) const;". This is used to generate perf-db keys.
- Each unique problem shall have unique db-key.
- May contain other helper members in order to ease implementation or Solvers, for example:
struct ProblemDescription
{...
int GetBackwardPad0() const { return kernel_size0 - pad0 - 1; }
int GetBackwardPad1() const { return kernel_size1 - pad1 - 1; }
...}
2.3. ExecutionContext
TBD|
2.3. Operation Context, e.g. ConvolutionContext
Inherits from ProblemDescription and ExecutionContext, so for example an instance of ConvolutionContext can be used as an instance of conv::ProblemDescription. More info TBD.
3. What is Solver
Solver is an object which encapsulates the implementation of specific primitive.
- It "knows" all the information required to properly use the kernel(s) it hides.
- Searchable (Tunable) Solvers work together with paired PerformanceConfig objects. In such a case the information about kernel(s) is distributed between Solver and its PerformanceConfig.
Member functions (see here for current prototypes):
bool IsApplicable(ConvolutionContext&);- Takes problem description (e.g. convolution parameters) and answers if this Solver is able to provide the Solution.
- This must be fast, i.e. must faster than kernel execution time.
- [Informative] The typical execution time should be ~10 us.
If a Solver needs workspace:
size_t GetWorkspaceSize(const ConvolutionContext&) constbool MayNeedWorkspace() const- This is optional method which is required for optimization of GWSS.
Each Solver instance s can be used as a parameter to GetSolverDbId(s) template function which retrieves the string id of the Solver. There is default implementation of GetSolverDbId() which returns the class name and can be overridden if necessary.
If a Solver is Dynamic:
bool IsDynamic() const { return true; }
If a Solver is NOT searchable (NOT tunable):
Solution GetSolution(ConvolutionContext&);- Returns Solution object
If a Solver is searchable (tunable), then also the accompanying PerformanceConfig type shall be defined plus some member functions:
PerformanceConfig GetPerformanceConfig(const ConvolutionContext&) const;- Provides default performance parameters
PerformanceConfig Search(const ConvolutionContext&) const;- Performs auto-tune and returns optimal parameters.
- This function could be quite complicated.
Solution GetSolution(const ConvolutionContext&, const PerformanceConfig&);- Returns Solution object generated from PerformanceConfig instance.
bool IsValidPerformanceConfig(const ConvolutionContext&, const PerformanceConfig&);- Checks if
PerformanceConfiginstance (e.g. read from the perf-db) is valid. - "Valid" means that Solution returned by
GetSolution():- Should build without errors and the resulting binary would be runnable
- Result of execution would be numerically correct.
- Checks performed by this function may be non-trivial due to limited amount of LDS, and, in case of asm kernels, SGPR/VGPR limitations etc.
- Checks if
Generic search.
Modern Solvers employ Generic search.
- Only legacy OpenCL kernels do not use it. These have complicated
Search(). - Generic Search allows implementation of
Search()in Solvers as simple calls to theGenericSearch()template function. - However this requires paired
PerformanceConfigtype to provide some member functions. These functions are used byGenericSearch()in order to build theComputedContainerobject and iterate over it. - ~Right now the Solver must also define the
RunAndMeasureSolution(). This is to be removed as soon as Invoker object concept is implemented.~
The PerformanceConfig of a modern searchable Solver type shall provide some functions. These are necessary to build the ComputedContainer instances. The following member functions are required for that:
(ctor)()- Constructs an instance with invalid value.
(ctor)(bool)- Constructs an instance with minimal, valid value.
SetNextValue(ConvolutionContext& c)- Note: It was
SetNextValue()before #1033. - Advances valid performance-config to the next available valid value and returns true. If max value reached (no more valid performance-configs left), returns false.
- IMPORTANT: the
PerformaneConfiginstancepcis valid if and only if theIsValidPerformanceConfig(..., pc)returns true. This ensures that all the perf-configs which reside in theComputedContainerare:- Buildable without errors.
- Runnable on GPU and yield correct result.
- Note: It was
IsValid(ConvolutionContext& c) const- Checks if instance is valid for the given c.
operator==(const PerformanceConfig&)- Ordinary semantics.
⚠️ IMPORTANT:
- The tunable Solver is allowed to have 0 performance configs available (empty primary and spare ComputedContainers), but in this case it must be able to provide valid default performance config (the one returned by
GetPerformanceConfig(const ConvolutionContext&)). - However it is highly recommended that tunable Solver provides 2 or more performance configs.
- Why: If there is only 1 performance config, then there is no reason to spend resources for tuning and saving/loading its results in the perf-db, because db accesses are not free.
- Therefore, if the applicability scope of some Solver contains many problems that have only 1 performance config, then it is highly recommended to split the Solver to two: one tunable and non-tunable.
Serialization/de-serialization of PerformanceConfig instances
All PerformanceConfig types shall implement the following member functions:
void Serialize(std::ostream&) const;- Converts the internal state of an instance ("value") to text.
bool Deserialize(const std::string& str)- Reads text and converts it to the internal state ("value") of an instance.
4. What is Solution
Information required to build and run a kernel (or a set of kernels), which is expected to perform computatons as per the ProblemConfig.
- Currently best suits a subset of existing solvers, namely some OpenCL-written forward direct convolutions. Shall be refactored (possibly, to a class hierarchy).
struct ConvSolution
{
std::vector<KernelInfo> construction_params; // impl may consist of multiple kernels.
miopenStatus_t status;
std::string solver_id;
...
}
As you see, it contains a vector of KernelInfo objects.
- Each object describes a kernel source and whatever information required in order to build and run it (the former is unused for binary kernels).
struct KernelInfo
{
std::string comp_options;
std::vector<size_t> l_wk;
std::vector<size_t> g_wk;
std::string kernel_file;
std::string kernel_name;
friend std::ostream& operator<<(std::ostream& os, const KernelInfo& k);
};
Just curious, should this topic belong or eventually belong to Wiki or Contribution Guide page? It looks like a guideline which we should follow. The intention for the issue is to bring up discussion and key decision?
@junliume This is a copy of a presentation for MIOpen team I held a couple of years ago, when we've introduced and implemented the Solver/Solution architecture. It does not cover the recent additions like GetWti() and Invokers.
I would like to make it available for all MIOpen developers, including collaborators.
5. Perf-db support
SolverDbId(solver)- Serialization and De-serialization of
PerformanceConfigs
6. Future directions
- Extend usage of Solvers to primitives other than plain Direct and Winograd convolutions
- GEMM, FFT convolution algorithms (done)
- Normalization, Pooling, Activation
- Fused convolutions
- Invoker objects (done)
- Allows for simplification of convolutionocl.cpp.
- Second effect is removal of RunAndMeasureSolution() from Solvers.
- More info at #216
7. Support for convolutions with non-packed tensors
Currently we are not going to include strides of non-packed tensors to the database keys. Only an optional flag (saying that at least one tensor is non-packed) should be included there. The above means that databases will share the same find-db records, same Invoker instances and same perf-db information for the non-packed convolutions that differ only in strides.
The above design should work correctly provided that:
- If an Invoker instance is able to compute some non-packed convolution, then the same instance must be able to compute any similar non-packed convolution that differs only in strides.
- [Consequence 1] An Invoker that is used for computation of non-packed convolutions must read stride information from
InvokeParamsand pass it to the kernels as arguments. - [Consequence 2] Stride information should not be used for building any compile-time parameters.
- [Consequence 1] An Invoker that is used for computation of non-packed convolutions must read stride information from
- If a kernel that is used to compute a non-packed convolution requires tuning, then the same tuning parameters must provide similar performance for non-packed convolutions that differs only in strides.
Originated from https://github.com/ROCmSoftwarePlatform/MIOpen/pull/2334#discussion_r1348090910