chapel
chapel copied to clipboard
BLAS: address banded matrix storage (gbmv, tbmv, etc.)
This PR attempts to address complexities with the BLAS banded matrix routine wrappers (general banded matrix-vector gbmv
, triangular banded tbmv
).
Context
gbmv
(resp. tbmv
) computes the matrix-vector product Ax = y
, where A
is a banded matrix (resp. triangular) of shape m,n
, x
is a vector of length n
, y
is vector of length m
.
General banded storage
A banded matrix A
of shape m,n
, with ku
upper bands and kl
lower bands can be represented in general as (adapted from OneAPI spec docs):
\begin{split}A = \left[\begin{smallmatrix}
A_{11} & A_{12} & A_{13} & \ldots & A_{1,ku+1} & * & \ldots & \ldots & \ldots & \ldots & \ldots & * \\
A_{21} & A_{22} & A_{23} & A_{24} & \ldots & A_{2,ku+2} & * & \ldots & \ldots & \ldots & \ldots & * \\
A_{31} & A_{32} & A_{33} & A_{34} & A_{35} & \ldots & A_{3,ku+3} & * & \ldots & \ldots & \ldots & * \\
\vdots & A_{42} & A_{43} & \ddots & \ddots & \ddots & \ddots & \ddots & * & \ldots & \ldots & \vdots \\
A_{kl+1,1} & \vdots & A_{53} & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & * & \ldots & \vdots \\
* & A_{kl+2,2} & \vdots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \vdots \\
\vdots & * & A_{kl+3,3} & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & * \\
\vdots & \vdots & * & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & A_{n-ku,n}\\
\vdots & \vdots & \vdots & * & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \vdots \\
\vdots & \vdots & \vdots & \vdots & * & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & A_{m-2,n} \\
\vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & A_{m-1,n} \\
* & * & * & \ldots & \ldots & \ldots & * & A_{m,m-kl} & \ldots & A_{m,n-2} & A_{m,n-1} & A_{m,n}
\end{smallmatrix}\right]\end{split}
In banded storage, the matrix A
is stored in a contiguous array as (see the documentation for ?gbmv
routines):
- In the case
order == CblasRowMajor
(adapted from OneAPI spec docs):
\scriptscriptstyle a =
[\underbrace{
\underbrace{\underbrace{*,...,*}_\text{kl},A_{11}, A_{12},...,A_{1,min(ku+1,n)},*,...,*}_\text{lda},
\underbrace{\underbrace{*,...,*}_\text{kl-1},A_{2,max(1,2-kl)},...,A_{2,min(ku+2,n)},*,...*}_\text{lda},
...,
\underbrace{\underbrace{*,...,*}_\text{max(0,kl-m+1)},A_{m,max(1,m-kl)},...,A_{m,min(ku+m,n)},*,...*}_\text{lda}
}_\text{lda $\times$ m}]
- In the case
order == CblasColMajor
(adapted from OneAPI spec docs):
\scriptscriptstyle a =
[\underbrace{
\underbrace{\underbrace{*,...,*}_\text{ku},A_{11}, A_{12},...,A_{min(kl+1,m),1},*,...,*}_\text{lda},
\underbrace{\underbrace{*,...,*}_\text{ku-1},A_{max(1,2-ku),2},...,A_{min(kl+2,m),2},*,...*}_\text{lda},
...,
\underbrace{\underbrace{*,...,*}_\text{max(0,ku-n+1)},A_{max(1,n-ku),n},...,A_{min(kl+n,m),n},*,...*}_\text{lda}
}_\text{lda $\times$ n}]
Elements *
are not read by the ?gbmv
routines, and may be whatever (e.g. set to zero).
Example
For example, consider the matrix A
with m=9
, n=8
, ku=3
and kl=2
(adapted from: IBM docs):
| 11 12 13 14 0 0 0 0 |
| 21 22 23 24 25 0 0 0 |
| 31 32 33 34 35 36 0 0 |
| 0 42 43 44 45 46 47 0 |
A = | 0 0 53 54 55 56 57 58 |
| 0 0 0 64 65 66 67 68 |
| 0 0 0 0 75 76 77 78 |
| 0 0 0 0 0 86 87 88 |
| 0 0 0 0 0 0 97 98 |
Column-major layout
In the column-major layout, the matrix A
is stored as:
* * * 11 21 31 * * 12 22 32 42 * 13 .. 97 58 68 78 88 98 *
| AB(.., 1) | AB(.., 2) | | AB(.., n) |
<- LDA elements ->
<------------------------------- n x LDA elements -------------------------------->
The column-major layout can be interpreted as a matrix AB_col
of shape LDA,n
(with LDA=kl+ku+1
) in a column-major language (e.g. Fortran), where the bands are arranged row-wise, from the upper bands to the lower bands.
| * * * 14 25 36 47 58 |
| * * 13 24 35 46 57 68 |
AB_col = | * 12 23 34 45 56 67 78 |
| 11 22 33 44 55 66 77 88 | <- main diagonal
| 21 32 43 54 65 76 87 98 |
| 31 42 53 64 75 86 97 * |
The column-major layout interpreted in a row-major language (e.g. C or Chapel) is interpreted as the transpose, where the matrix AB_col^t
has shape n,LDA
:
| * * * 11 21 31 |
| * * 12 22 32 42 |
| * 13 23 33 43 53 |
AB_col^t = | 14 24 34 44 54 64 |
| 25 35 45 55 65 75 |
| 36 46 56 66 76 86 |
| 47 57 67 77 87 97 |
| 58 68 78 88 98 * |
Row-major layout
In row-major layout, the matrix A
is stored as:
* * 11 12 13 14 * 21 22 .. * * 97 98 * * * *
| AB(1, ..) | | AB(m, ..) |
<- LDA elements ->
<--------------------------- m x LDA elements -------------------->
The row-major layout can be interpreted as a matrix AB_row
of shape m,LDA
(with LDA=kl+ku+1
) in a row-major language, where the bands are arranged column-wise, from the lower bands to the upper bands:
main diagonal
v
| * * 11 12 13 14 |
| * 21 22 23 24 25 |
| 31 32 33 34 35 36 |
| 42 43 44 45 46 47 |
AB_row = | 53 54 55 56 57 58 |
| 64 65 66 67 68 * |
| 75 76 77 78 * * |
| 86 87 88 * * * |
| 97 98 * * * * |
The row-major layout is interpreted in a column-major language as the tranpose AB_row^t
of shape LDA,m
.
BLAS routine wrappers
In modules/packages/BLAS.chpl
, the procedure gbmv
wraps the usage of cblas_?gbmv
by providing automatic detection of the array shapes.
Detection of LDA
is done with:
var _ldA = getLeadingDim(A, order);
// ...
inline proc getLeadingDim(A: [?Adom], order : Order) : c_int {
require header;
if order==Order.Row then
return Adom.dim(1).size : c_int;
else
return Adom.dim(0).size : c_int;
}
Due to the signature of the gbmv
wrapper:
proc gbmv(A : [?Adom] ?eltType,
X : [?Xdom] eltType, Y : [?Ydom] eltType,
alpha: eltType, beta: eltType,
kl : int = 0, ku : int = 0,
trans : Op = Op.N,
order : Order = Order.Row, incx : c_int = 1, incy : c_int = 1)
where (Adom.rank == 2) && (Xdom.rank==1) && (Ydom.rank == 1)
a user is forced to pass a matrix (Adom.rank == 2
) as A
, which means they pass:
-
AB_col^t
of shapen,LDA
iforder == Order.Col
-
AB_row
of shapem,LDA
iforder == Order.Row
Proposed solution
This means that LDA
is always on the last dimension:
var _ldA = Adom.dim(1).size : c_int;
The counterpoint to this solution is that semantically LDA
is no longer the "leading dimension" (number of rows), but here the number of columns. Although one can argue that the "leading dimension" in a column-major language is the number of rows, and in a row-major language the number of columns.
Alternative solutions
-
In order to be consistent with the "math" point of view (col-layout interpreted by col-language, row-layout interpreted by row-language), the user passes
AB_col
withorder == Order.Col
andAB_row
withorder == Order.Row
. In this case_ldA = getLeadingDim(A, order)
is correct, but the matrixAB_col
needs to be transposed before thecblas_?gbmv
call so that it has the correct layout in memory, which might be expensive. -
Force the user to pass a one-dimensional array to
gbmv
. The user is responsible for doing the index computations. In this case,LDA
should also be provided. Though this defies the purpose of having a wrapper in the first place.
Code demo
With the proposed changes:
// file: bandedformat.chpl
use Random only fillRandom;
import LinearAlgebra, BLAS;
config const n = 6, m = 7, kl = 2, ku = 3;
config const seed = 42;
config const epsilon = 1e-13;
config const verbose = true;
var A: [1..m, 1..n] real;
fillRandom(A, seed);
A = floor(9*A + 1);
makeBanded(A, kl, ku);
if verbose then writeln("A=\n", A);
var x: [1..n] real;
fillRandom(x, seed+1);
x = floor(9*x + 1);
if verbose then writeln("x=\n", x);
var y: [1..m] real = LinearAlgebra.dot(A, x);
if verbose then writeln("y=\n", y);
for param order in (BLAS.Order.Row..BLAS.Order.Col) {
writeln("-- order = ", order);
var AB = bandedStorage(A, kl, ku, order);
if verbose then writeln("AB_", order, if order == BLAS.Order.Col then "^t" else "", "=\n", AB);
var y_: [1..m] real;
BLAS.gbmv(AB, x, y_, 1.0, 0.0, kl, ku, BLAS.Op.N, order);
if verbose then writeln("y_=\n", y_);
writeln("correct? ", && reduce [d in abs(y - y_)] d < epsilon);
}
proc makeBanded(ref A: [?D] ?T, kl: int, ku: int) {
const m = D.dim(0).size;
for (i,j) in D do A(i,j) = if (max(1,j-ku) <= i && i <= min(m, j+kl)) then A(i,j) else 0.0: T;
}
proc bandedStorage(A: [?D] ?T, kl: int, ku: int, param order: BLAS.Order)
where order == BLAS.Order.Row {
const m = D.dim(0).size, n = D.dim(1).size, lda = kl+ku+1;
var AB: [1..m, 1..lda] T;
// adapted from: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-1/cblas-gbmv.html
for i in 1..m {
var k = kl + 1 - i;
for j in max(1, i-kl)..min(n, i+ku) do AB(i, k+j) = A(i, j);
}
return AB;
}
proc bandedStorage(A: [?D] ?T, kl: int, ku: int, param order: BLAS.Order)
where order == BLAS.Order.Col {
const m = D.dim(0).size, n = D.dim(1).size, lda = kl+ku+1;
var AB: [1..n, 1..lda] T;
// adapted from: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2024-1/cblas-gbmv.html
for j in 1..n {
var k = ku + 1 - j;
for i in max(1, j-ku)..min(m, j+kl) do AB(j, k+i) = A(i, j);
}
return AB;
}
Output (compiled with -lopenblas
):
./bandedformat --n 6 --m 7 --kl 2 --ku 3
A=
3.0 9.0 4.0 9.0 0.0 0.0
5.0 5.0 1.0 3.0 7.0 0.0
1.0 3.0 2.0 7.0 7.0 7.0
0.0 9.0 9.0 1.0 3.0 2.0
0.0 0.0 6.0 7.0 6.0 9.0
0.0 0.0 0.0 2.0 2.0 9.0
0.0 0.0 0.0 0.0 2.0 5.0
x=
9.0 2.0 6.0 4.0 2.0 5.0
y=
105.0 87.0 104.0 92.0 121.0 57.0 29.0
-- order = Row
AB_Row=
0.0 0.0 3.0 9.0 4.0 9.0
0.0 5.0 5.0 1.0 3.0 7.0
1.0 3.0 2.0 7.0 7.0 7.0
9.0 9.0 1.0 3.0 2.0 0.0
6.0 7.0 6.0 9.0 0.0 0.0
2.0 2.0 9.0 0.0 0.0 0.0
2.0 5.0 0.0 0.0 0.0 0.0
m=7 n=6 kl=2 ku=3 ldA=6 incx=1 incy=1
y_=
105.0 87.0 104.0 92.0 121.0 57.0 29.0
correct? true
-- order = Col
AB_Col^t=
0.0 0.0 0.0 3.0 5.0 1.0
0.0 0.0 9.0 5.0 3.0 9.0
0.0 4.0 1.0 2.0 9.0 6.0
9.0 3.0 7.0 1.0 7.0 2.0
7.0 7.0 3.0 6.0 2.0 2.0
7.0 2.0 9.0 9.0 5.0 0.0
m=7 n=6 kl=2 ku=3 ldA=6 incx=1 incy=1
y_=
105.0 87.0 104.0 92.0 121.0 57.0 29.0
correct? true
Next steps
- discuss which solution is more appropriate
- implement the same changes to triangular banded storage
- implements tests for banded routines in
test/library/BLAS/test_blas2.chpl
- update documentation