MIOpen Solver/Solution framework

This is a copy of a presentation for MIOpen team I held a couple of years ago, when we've introduced and implemented the Solver/Solution architecture. It does not cover the recent additions like GetWti() and Invokers. I would like to make it available for all MIOpen developers, including collaborators.

Though a bit outdated, this should provide a good overview of how device code is abstracted away from the rest of the library.

1. Intent

Problem: Variety of convolution kernels

Different formats (OCL, asm, binary)
Capable to do different (but often intersecting) subsets of convolution configs
Some may benefit from auto-tuning, some not.
Each kernel may require its own set of #defines for build
There are cases when two (or more kernels) needs to be run to do the job
Kernels need different parameters during invocation

Experience shows that straightforward attempts to support such a set of kernels result in host code which is large, fragile, difficult to develop and maintain. You may see leftovers of this in convolutionocl.cpp.

Provide abstractions which able to represent in the single place all the information required to

Select kernel(s) which can do the required job
Build these kernel(s)
Run kernels
Auto-tune kernel(s)

Such abstractions allow working with all convolutions in unified manner. Currently, there are:

ProblemDescriptions
Solvers
- Non-searchable
- Searchable, which implement auto-tune internally (legacy OpenCL)
- Searchable, which employ Generic Search
Solutions
PerformanceConfigs (paired with searchable Solvers)

Apr 19 '21 15:04 atamazov

2. Problem Description and Context

2.1. Problem Description for an operation, e.g. `conv::ProblemDescription`

This is an input for the Solver.

This is an object which represents a set of parameters that shall unambiguously describe the specific library primitive. Example for convolutions:

struct ProblemDescription
{
    int n_inputs         = 0;
    int in_height        = 0;
    int in_width         = 0;
    int kernel_size1     = 0;
    int kernel_size0     = 0;
    int n_outputs        = 0;
...
    struct Direction direction; // has members like IsBackwardData() etc
...
}

If perf-db is used, the object must provide also "void Serialize(std::ostream& stream) const;". This is used to generate perf-db keys.
- Each unique problem shall have unique db-key.
May contain other helper members in order to ease implementation or Solvers, for example:

struct ProblemDescription
{...
    int GetBackwardPad0() const { return kernel_size0 - pad0 - 1; }
    int GetBackwardPad1() const { return kernel_size1 - pad1 - 1; }
...}

2.3. `ExecutionContext`

TBD|

2.3. Operation Context, e.g. `ConvolutionContext`

Inherits from ProblemDescription and ExecutionContext, so for example an instance of ConvolutionContext can be used as an instance of conv::ProblemDescription. More info TBD.

Apr 19 '21 15:04 atamazov

3. What is Solver

Solver is an object which encapsulates the implementation of specific primitive.

It "knows" all the information required to properly use the kernel(s) it hides.
Searchable (Tunable) Solvers work together with paired PerformanceConfig objects. In such a case the information about kernel(s) is distributed between Solver and its PerformanceConfig.

Member functions (see here for current prototypes):

bool IsApplicable(ConvolutionContext&);
- Takes problem description (e.g. convolution parameters) and answers if this Solver is able to provide the Solution.
- This must be fast, i.e. must faster than kernel execution time.
  - [Informative] The typical execution time should be ~10 us.

If a Solver needs workspace:

size_t GetWorkspaceSize(const ConvolutionContext&) const
bool MayNeedWorkspace() const
- This is optional method which is required for optimization of GWSS.

Each Solver instance s can be used as a parameter to GetSolverDbId(s) template function which retrieves the string id of the Solver. There is default implementation of GetSolverDbId() which returns the class name and can be overridden if necessary.

If a Solver is Dynamic:

bool IsDynamic() const { return true; }

If a Solver is NOT searchable (NOT tunable):

Solution GetSolution(ConvolutionContext&);
- Returns Solution object

If a Solver is searchable (tunable), then also the accompanying PerformanceConfig type shall be defined plus some member functions:

PerformanceConfig GetPerformanceConfig(const ConvolutionContext&) const;
- Provides default performance parameters
PerformanceConfig Search(const ConvolutionContext&) const;
- Performs auto-tune and returns optimal parameters.
- This function could be quite complicated.
Solution GetSolution(const ConvolutionContext&, const PerformanceConfig&);
- Returns Solution object generated from PerformanceConfig instance.
bool IsValidPerformanceConfig(const ConvolutionContext&, const PerformanceConfig&);
- Checks if PerformanceConfig instance (e.g. read from the perf-db) is valid.
- "Valid" means that Solution returned by GetSolution():
  - Should build without errors and the resulting binary would be runnable
  - Result of execution would be numerically correct.
  - Checks performed by this function may be non-trivial due to limited amount of LDS, and, in case of asm kernels, SGPR/VGPR limitations etc.

Generic search.

Modern Solvers employ Generic search.

Only legacy OpenCL kernels do not use it. These have complicated Search().
Generic Search allows implementation of Search() in Solvers as simple calls to the GenericSearch() template function.
However this requires paired PerformanceConfig type to provide some member functions. These functions are used by GenericSearch() in order to build the ComputedContainer object and iterate over it.
~Right now the Solver must also define the RunAndMeasureSolution(). This is to be removed as soon as Invoker object concept is implemented.~

The PerformanceConfig of a modern searchable Solver type shall provide some functions. These are necessary to build the ComputedContainer instances. The following member functions are required for that:

(ctor)()
- Constructs an instance with invalid value.
(ctor)(bool)
- Constructs an instance with minimal, valid value.
SetNextValue(ConvolutionContext& c)
- Note: It was SetNextValue() before #1033.
- Advances valid performance-config to the next available valid value and returns true. If max value reached (no more valid performance-configs left), returns false.
- IMPORTANT: the PerformaneConfig instance pc is valid if and only if the IsValidPerformanceConfig(..., pc) returns true. This ensures that all the perf-configs which reside in the ComputedContainer are:
  - Buildable without errors.
  - Runnable on GPU and yield correct result.
IsValid(ConvolutionContext& c) const
- Checks if instance is valid for the given c.
operator==(const PerformanceConfig&)
- Ordinary semantics.

⚠️ IMPORTANT:

The tunable Solver is allowed to have 0 performance configs available (empty primary and spare ComputedContainers), but in this case it must be able to provide valid default performance config (the one returned by GetPerformanceConfig(const ConvolutionContext&)).
However it is highly recommended that tunable Solver provides 2 or more performance configs.
- Why: If there is only 1 performance config, then there is no reason to spend resources for tuning and saving/loading its results in the perf-db, because db accesses are not free.
- Therefore, if the applicability scope of some Solver contains many problems that have only 1 performance config, then it is highly recommended to split the Solver to two: one tunable and non-tunable.

Serialization/de-serialization of PerformanceConfig instances

All PerformanceConfig types shall implement the following member functions:

void Serialize(std::ostream&) const;
- Converts the internal state of an instance ("value") to text.
bool Deserialize(const std::string& str)
- Reads text and converts it to the internal state ("value") of an instance.

Apr 19 '21 16:04 atamazov

4. What is Solution

Information required to build and run a kernel (or a set of kernels), which is expected to perform computatons as per the ProblemConfig.

Currently best suits a subset of existing solvers, namely some OpenCL-written forward direct convolutions. Shall be refactored (possibly, to a class hierarchy).

struct ConvSolution
{
    std::vector<KernelInfo> construction_params; // impl may consist of multiple kernels.
    miopenStatus_t status;
    std::string solver_id;
    ...
}

As you see, it contains a vector of KernelInfo objects.

Each object describes a kernel source and whatever information required in order to build and run it (the former is unused for binary kernels).

struct KernelInfo
{
    std::string comp_options;
    std::vector<size_t> l_wk;
    std::vector<size_t> g_wk;
    std::string kernel_file;
    std::string kernel_name;
    friend std::ostream& operator<<(std::ostream& os, const KernelInfo& k);
};

Apr 19 '21 16:04 atamazov

Just curious, should this topic belong or eventually belong to Wiki or Contribution Guide page? It looks like a guideline which we should follow. The intention for the issue is to bring up discussion and key decision?

Apr 19 '21 16:04 junliume

@junliume This is a copy of a presentation for MIOpen team I held a couple of years ago, when we've introduced and implemented the Solver/Solution architecture. It does not cover the recent additions like GetWti() and Invokers.

I would like to make it available for all MIOpen developers, including collaborators.

Apr 19 '21 16:04 atamazov

5. Perf-db support

SolverDbId(solver)
Serialization and De-serialization of PerformanceConfigs

Apr 19 '21 17:04 atamazov

6. Future directions

Extend usage of Solvers to primitives other than plain Direct and Winograd convolutions
- GEMM, FFT convolution algorithms (done)
- Normalization, Pooling, Activation
- Fused convolutions
Invoker objects (done)
- Allows for simplification of convolutionocl.cpp.
- Second effect is removal of RunAndMeasureSolution() from Solvers.
- More info at #216

Apr 19 '21 17:04 atamazov

7. Support for convolutions with non-packed tensors

Currently we are not going to include strides of non-packed tensors to the database keys. Only an optional flag (saying that at least one tensor is non-packed) should be included there. The above means that databases will share the same find-db records, same Invoker instances and same perf-db information for the non-packed convolutions that differ only in strides.

The above design should work correctly provided that:

If an Invoker instance is able to compute some non-packed convolution, then the same instance must be able to compute any similar non-packed convolution that differs only in strides.
- [Consequence 1] An Invoker that is used for computation of non-packed convolutions must read stride information from InvokeParams and pass it to the kernels as arguments.
- [Consequence 2] Stride information should not be used for building any compile-time parameters.
If a kernel that is used to compute a non-packed convolution requires tuning, then the same tuning parameters must provide similar performance for non-packed convolutions that differs only in strides.

Originated from https://github.com/ROCmSoftwarePlatform/MIOpen/pull/2334#discussion_r1348090910

Oct 05 '23 23:10 atamazov

Solver/Solution framework

1. Intent

2. Problem Description and Context

2.1. Problem Description for an operation, e.g. conv::ProblemDescription

2.3. ExecutionContext

2.3. Operation Context, e.g. ConvolutionContext

3. What is Solver

Member functions (see here for current prototypes):

Generic search.

Serialization/de-serialization of PerformanceConfig instances

4. What is Solution

5. Perf-db support

6. Future directions

7. Support for convolutions with non-packed tensors

2.1. Problem Description for an operation, e.g. `conv::ProblemDescription`

2.3. `ExecutionContext`

2.3. Operation Context, e.g. `ConvolutionContext`