Loss of precision in all single precision workspace queries, implicit and explicit
Hello everybody,
recently I stumbled across an error when using ssyevd via the lapacke interface. It points to a problem in the lapack interface in general. It goes like this:
According the the lapack standard interface, many routines like ssyevd you have to call twice: Once for asking the routine how much scratch memory it needs for a certain matrix size, and only then call the routine in earnest with the required scratch memory sections as parameters.
If you look closely at this first call, which should return the required memory size, e.g. for a routine like ssyevd, you see that even according to the lapack documentation, the memory requirement is passed back via a pointer to a float value. So when calculating the memory it goes through a series of values:
calculate memory -> store value in reference -> retrieve the value for use (allocation)
int64 float int64
It is int64 is in case of an ilp64 interface, otherwise it would be int32. So in essence, we have an intermediary shortening of the memory value from 63 bit to 24 bit !!! (or more precise, to 24bit between the outermost set bits in IEEE 754 float representation) Even in the case of 32bit integers, you have a shortening of 31 bits to 24 bits.
So, if you would call the memory requirement calculation as in the past 'by hand' you might have chance to see where it goes wrong, but still you cannot prevent the shorting to a float value. If you use the automatic memory allocation via the modern lapacke interface you don't even have an idea what could be wrong, as the routine advertises to take care of all memory management by itself!
This happens in all single precision routines (s/c) in lapack, which calculate the memory requirement as an intermediary step. Switching to double precison would then use a double precision reference instead, increasing the intermediary value to 53 bits instead, which is still not close to the 64 bits one would assume with a 64bit interface.
Workaround, four possible ways:
- If you want to use single or complex lapack routines, do not use the automatic memory allocation via the C lapacke interface
- If you use the two-call lapack function method, for the memory calculation use the double(!) routine
- Have a look at the reference implementation of the lapack routine and calculate the required memory on your own
- Only use small matrix sizes when using float/complex matrices
Other people have stumbled upon this, but didn't follow it through to the real cause, e.g. openBLAS build with int64 support fails on valid input for ssyevd
One has to stress, that at least with the two-call method, this is not a bug, but a design flaw. In case of the lapacke automatic memory allocation, it has to be considered a fairly severe bug.
Regards,
oxydicer
I guess it must have made sense at the time (to return the size via the work array pointer), but I wonder what keeps us from turning the size specifier into an in/out variable and passing back the exact value there as well? A "modern" caller would then check that first and resort to the work array member only if lwork was still -1, an "old" caller would notice no change.
Also probably the required LWORK back then was physically way too big to hit this problem and ilp64 made it obvious. I'm dreaming the days where NB value is derived at runtime instead of two-call scheme.
Here are two discussion threads related to this: https://icl.cs.utk.edu/lapack-forum/viewtopic.php?t=1418 http://icl.cs.utk.edu/lapack-forum/archives/lapack/msg00827.html
This is especially an issue for algorithm that require an O(n^2) workspace. For algorithm that requires an O(n*nb) workspace, this is less of an issue.
Yes, this is a design flaw.
@martin-frbg: how would your proposed change for the C _work interface? We have LWORK as INPUT only there. Changing LWORK to INPUT/OUTPUT there is a major change. Do you have an idea to solve this issue? See: https://github.com/Reference-LAPACK/lapack/blob/aa631b4b4bd13f6ae2dbab9ae9da209e1e05b0fc/LAPACKE/src/lapacke_dgeqrf_work.c#L35
I was thinking that we could also create some workspace allocation subroutines such as LAPACK_dgeqrf__workspace_query( ) and this would return the workspace needed.
Welp, my cunning plan does not really work when sober... But these are actually two problems I think, one the work size overflowing a lapack_int and the other "only" a misrepresentation due to limited precision - I wonder if it would be possible to round up the calculated size to anticipate the latter at the expense of "some" unused memory ?
Yes, this is a design flaw.
I can see 2 different flaws here:
- On LAPACK: routines return the work size using a real variable.
- On LAPACKE: routines propagate the flaw from LAPACK.
@martin-frbg's idea is a good solution to (1). New Fortran code could use the return value of LWORK instead of WORK(1). We can try modifying the code with some (semi-)automatic procedure for replacement. In ssyevd.f, for instance, we could replace
ELSE IF( LQUERY ) THEN
RETURN
END IF
by
ELSE IF( LQUERY ) THEN
LWORK = LOPT
RETURN
END IF
Adding LAPACKE_dgeqrf__work_query(), as @langou suggests, solves (2), although there is a lot of work associated with this modification.
Changing lwork to be IN/OUT would be a nice solution originally, but it is not backwards compatible. The application would then have to know whether the LAPACK version was <= 3.10 (say) or > 3.10 to know where to get the lwork. Worse, there are instances where applications pass in a const value — expecting it to remain const — so LAPACK changing its behavior to overwrite that value would be very detrimental (UB). For instance, in MAGMA:
const magma_int_t ineg_one = -1;
...
magma_int_t query_magma, query_lapack;
magma_zgesdd( *jobz, M, N,
unused, lda, runused,
unused, ldu,
unused, ldv,
dummy, ineg_one, // overwriting ineg_one would break MAGMA
#ifdef COMPLEX
runused,
#endif
iunused, &info );
assert( info == 0 );
query_magma = (magma_int_t) MAGMA_Z_REAL( dummy[0] );
The solution I proposed some years ago and implemented in MAGMA is simply in sgesdd, etc., to round the lwork returned in work[1] up a little bit, so the returned value is always >= the intended value. See https://bitbucket.org/icl/magma/src/master/control/magma_zauxiliary.cpp and use in https://bitbucket.org/icl/magma/src/master/src/zgesdd.cpp. (See a release for the generated single-precision version.) Basically replace
WORK( 1 ) = MAXWRK
with
WORK( 1 ) = lapack_roundup_lwork( MAXWRK )
where the function lapack_roundup_lwork rounds it up slightly, as magma_*make_lwork does. In MAGMA, I rounded up by multiplying by (1 + eps), using single-precision eps but doing the calculation in double. Then existing applications will behave correctly without any need to change their workspace queries.
After more testing, I found for lwork > 2^54, it needs to use C/C++/Fortran definition of epsilon = 1.19e-07 (aka ulp), rather than the LAPACK definition slamch("eps") = 5.96e-08 (aka unit roundoff, u). If using ulp, it looks like the calculation can be done in single.
I think this issue confuses 32-bit overflow with float-integer conversion. The conversion from 32-bit float to 32-bit integer can be at most off by 256=2^8 because 32-bit floating point can only be off 8bits (23-bit mantissa vs 31 bits in integer). In practice, I couldn't get the difference to be larger than 63. This can become an issue in ILP64 mode when 23-bit mantissa would not be enough to store 63 bits of integer. Adding epsilon could potential overestimate allocation by over a Tera-byte. The real problem is when the workspace indicates a square NxN matrix when N is large enough to overflow 32-bit integer. Adding ULP or EPS does not solve that problem at all. Just use N=46341 and N*N will overflow.
I think there is no confusion but two separate (but related) issues - one brought up in the original message, where ILP64 gets truncated to float (and rounding up the result - still as a float - should get around that truncation). I am not sure if I read the MAGMA code correctly, but to me it looked as if the roundup there was only performed for single precision float, with lwork returned unmodified otherwise, so the proposed PR would seem to go beyond the recommendation ? The other is int32 overflows as dug up by langou from the mailing list archive, and I believe nothing except using a LAPACK built with ILP64 will get around that.
I think this issue confuses 32-bit overflow with float-integer conversion. The conversion from 32-bit float to 32-bit integer can be at most off by 256=2^8 because 32-bit floating point can only be off 8bits (23-bit mantissa vs 31 bits in integer). In practice, I couldn't get the difference to be larger than 63. This can become an issue in ILP64 mode when 23-bit mantissa would not be enough to store 63 bits of integer.
Yes, I agree there are multiple problems. #605 tries to solve the underestimation of LWORK in the conversion INTEGER -> FLOAT (DOUBLE PRECISION).
Adding epsilon could potential overestimate allocation by over a Tera-byte.
Just to make it clearer for the discussion. If we use a float (23-bit mantissa) to store a big int64 N = 2^60-1, we obtain:
float( (2^60-1)*(1+2^-23) ) = 2^60*(1+2^-23) = (2^60-1) + (2^37+1),
where (2^37+1) is the overestimation. I don't see a big problem here because, although (2^60-1) floating-point positions represent occupies a large memory space, this is nothing compared to the total amount, i.e., 2^60*(1+2^-23). The increment in the workspace is eps*N at most, which means there is a relative overestimation of eps/(1+eps).
We could try to avoid this overestimation in the case where N can be represented exactly. We would need a specific check for that.
I updated #605.
-
Now we don't need to overestimate if the integer LWORK is smaller than 2**(1-t), where t is the number of digits in the mantissa of the floating-point number.
-
Also, the code in MAGMA uses DLAMCH('E') as EPS to compute (1+EPS). I am not sure this is the correct option. In my computer,
1.0E0 + SLAMCH('E') .EQ. 1.0E0 and 1.0D0 + DLAMCH('E') .EQ. 1.0D0
using LAPACK 3.10. Mind that DLAMCH('E') can be either EPSILON(1.0D0) or EPSILON(1.0D0) * 0.5.
We just merged #605 from Weslley. This is a start. This fixes the rounding during the INTEGER to SINGLE/DOUBLE PRECISION cast. The workspace computation still needs more work. Since the computation is intrinsically done using INTEGER, the computation will overflow independently of rounding (during the INTEGER to SINGLE/DOUBLE PRECISION). So next step is to compute the workspace directly in SINGLE/DOUBLE PRECISION. #605 is a start.