stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

Returning array from function

Open leonfoks opened this issue 4 years ago • 23 comments

I was going to ask this in #113 but didn't want to derail that thread. This is more of a personal question... since I've actually done very little with returning an allocatable array from a function. I may have flawed logic, so please correct me if i'm wrong. I've had compile issues in the past returning allocatable from functions (cant remember all the details but perhaps others have had these too) and I "intuitively" thought there might be a slow down if those functions need calling multiple times.

Ill use the snippet from #113

function mean(mat, dim) result(res)
    real(sp), intent(in) :: mat(:,:)
    integer, intent(in), optional :: dim !dim = 1 or dim = 2 
    real(sp), allocatable :: res(:)
end function

If I have a large array, and I call this multiple times, does it reallocate memory for that variable every time? Or is there some cleverness that happens?

e.g. 1D result is already allocated, function allocates its own result, on return those results are copied to pre-allocated array.

real(sp), allocatable :: x(:), A(:, :)

! A = full of numbers
allocate(x(shape(A)(2)))

x(:) = mean(A, dim=1)

Is it known what happens specifically, memory wise? Is it compiler dependent?

I'm asking because it will be slower if allocations keep happening, we might want to be careful and always make subroutines available where preallocated memory can be accessed without any implicit allocations and then copies. Am I worrying about this too much haha.

leonfoks avatar Jan 19 '20 19:01 leonfoks

This is really important.

I don't think we should use allocatable at all in the lowest level API, unless the size of the output is truly not known ahead of time, like in loadtxt.

Also, in the past I have found that if a function returns an array (non allocatable), it can be slower than an equivalent subroutine with intent(out) array. Apparently there is an extra copy happening. This is something where it feels Fortran compilers must do better. I feel ideally there shouldn't be an overhead of using functions over subroutines, but unfortunately it seems there is.

On Sun, Jan 19, 2020, at 12:07 PM, Leon Foks wrote:

I was going to ask this in #113 https://github.com/fortran-lang/stdlib/issues/113 but didn't want to derail that thread. This is more of a personal question... since I've actually done very little with returning an allocatable array from a function. I may have flawed logic, so please correct me if i'm wrong. I've had compile issues in the past returning allocatable from functions (cant remember all the details but perhaps others have had these too) and I "intuitively" thought there might be a slow down if those functions need calling multiple times.

Ill use the snippet from #113 https://github.com/fortran-lang/stdlib/issues/113

function mean(mat, dim) result(res) real(sp), intent(in) :: mat(:,:) integer, intent(in), optional :: dim !dim = 1 or dim = 2 real(sp), allocatable :: res(:) end function If I have a large array, and I call this multiple times, does it reallocate memory for that variable every time? Or is there some cleverness that happens?

e.g. 1D result is already allocated, function allocates its own result, on return those results are copied to pre-allocated array.

real(sp), allocatable :: x(:), A(:, :)

! A = full of numbers allocate(x(shape(A)(2)))

x(:) = mean(A, dim=1) Is it known what happens specifically, memory wise? Is it compiler dependent?

I'm asking because it will be slower if allocations keep happening, we might want to be careful and always make subroutines available where preallocated memory can be accessed without any implicit allocations and then copies. Am I worrying about this too much haha.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/114?email_source=notifications&email_token=AAAFAWC6TNXY3SQSYYYBK3DQ6SQI7A5CNFSM4KI2655KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IHGL63Q, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAFAWBI63LLCUGKPWVAQNTQ6SQI7ANCNFSM4KI2655A.

certik avatar Jan 19 '20 19:01 certik

Phew, I was hoping I wasn't crazy haha

leonfoks avatar Jan 19 '20 19:01 leonfoks

I don't believe there is any performance difference between an automatically-sized array valued function, for example

function foo(array) result(res)
real, intent(in) :: array(:,:)
real :: res(size(array,dim=2))

versus one whose result is allocatable. In either case the size of the result is determined at run time. So there is absolutely no reason to avoid allocatable results if one is determined to use a function.

Also, in the past I have found that if a function returns an array (non allocatable), it can be slower than an equivalent subroutine with intent(out) array. Apparently there is an extra copy happening.

This is the real problem with array-valued functions, allocatable or not, and has nothing to do with allocatable.

nncarlson avatar Jan 19 '20 23:01 nncarlson

So there is absolutely no reason to avoid allocatable results if one is determined to use a function.

How about the allocation on stack vs. heap? Yes, this isn't defined in the language and is implementation specific, but I found that in practice there is a difference:

  • Non-allocatable function result may be allocated on the stack, which is faster;
  • Allocatable function result is always allocated on the heap, which is slower;

I never measured this, but I have anecdotal personal experience of WRF being noticeably slower when compiled with -heap-arrays (ifort flag to make all arrays be allocated on heap).

This is also the caveat with non-allocatable function (or subroutine) results (output arguments): For large arrays, stack can overflow. This problem doesn't occur with allocatable arrays.

milancurcic avatar Jan 20 '20 16:01 milancurcic

I can certainly believe there is a performance difference between stack and heap allocated. But in the case of an automatic array result that must be allocated at run time, do you think the compiler inserts code to decide where to allocate (stack or heap) based on the size it determines for a particular invocation of the function? (Note I'm not talking about a size known at compile time.) I suppose that could be the case; do you know? But even if so, I think the far bigger cost is the one @certik pointed out with using a function at all.

nncarlson avatar Jan 20 '20 17:01 nncarlson

We should setup a project to carefully benchmark all these options. Note that gfortran and Intel can put an allocatable on a stack as well.

On Mon, Jan 20, 2020, at 9:59 AM, Milan Curcic wrote:

So there is absolutely no reason to avoid allocatable results if one is determined to use a function.

How about the allocation on stack vs. heap? Yes, this isn't defined in the language and is implementation specific, but I found that in practice there is a difference:

  • Non-allocatable function result may be allocated on the stack, which is faster;
  • Allocatable function result is always allocated on the heap, which is slower; I never measured this, but I have anecdotal personal experience of WRF being noticeably slower when compiled with -heap-arrays (ifort flag to make all arrays be allocated on heap).

This is also the caveat with non-allocatable function (or subroutine) results (output arguments): For large arrays, stack can overflow. This problem doesn't occur with allocatable arrays.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/114?email_source=notifications&email_token=AAAFAWCIENGWONBPWR32MH3Q6XJ5JA5CNFSM4KI2655KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJNIWIY#issuecomment-576359203, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAFAWAEIKYND6XWHNJ4S3TQ6XJ5JANCNFSM4KI2655A.

certik avatar Jan 20 '20 17:01 certik

We should setup a project to carefully benchmark all these options.

That would be very interesting to see. And there are many other specific performance questions like this. Conclusions from such a project could feed into a "Fortran best practices" guide.

Note that gfortran and Intel can put an allocatable on a stack as well.

Yep. My experience with intel is that I routinely have to unlimit the stack size to avoid hitting a segfault. I have no such issues with NAG, for example.

nncarlson avatar Jan 20 '20 17:01 nncarlson

Also I want to setup benchmarks for operations on arrays (y(:) = a(:) + b(:)) compared to an explicit loop. The loop seems to be faster some times and I have seen several codes moving away from arrays to explicit loops because of that.

Those are things where I want Fortran compilers to do better. And a first step is to have reliable benchmarks, showing where the "performance bug" is.

certik avatar Jan 20 '20 18:01 certik

In that benchmark, a manually unrolled loop version would be cool to see. The compilers should already do a good job of this. It would be nice to see in a benchmark though if it’s worth it?

leonfoks avatar Jan 20 '20 18:01 leonfoks

This looks like a good candidate for a fortran-lang/benchmarks repo. This could include both benchmarks of stdlib, and more fundamental basic operations discussed in this thread. It would be nice to have a pipeline there that automatically generates a report for all procedures/programs in the suite. Later on when we change implementations of some stdlib procedures that are used by other procedures, it would be useful to verify that there's no significant performance regression when changing the implementation of a building block.

Besides these, I'm also curious about what's the overhead of calling a generic procedure that overloads some 50 or 100 specific procedures.

milancurcic avatar Jan 20 '20 18:01 milancurcic

Additionally one should be careful with allocatable statements, the below is a perfectly reasonable code:

real, allocatable :: a(:)

a = [1, 2, 3]
a = [1, 2, 3, 4]

However, the compiler have to add checks for sizes and do deallocation/allocation in case they aren't commensurate.
To force the compiler to not check one has to add this:

a(:) = [1, 2, 3]

in which case it will segfault if the dimensions doesn't match. If this library is low-level enough it should not rely on automatic allocation AND it should disallow array dimension checks (i.e. use a(:) = )

zerothi avatar Apr 06 '20 11:04 zerothi

@zerothi indeed, that seems to be what is needed now. It seems very unfortunate that the natural syntax a = [1, 2, 3] has a runtime overhead, and one must use a(:) = [1, 2, 3] instead.

What's worse, say the code looks like this:

real :: a(3)
...
a = [1, 2, 3]

and then later you decide to change the declaration to allocatable:

real, allocatable :: a(:)
allocate(a(3))
...
a = [1, 2, 3]

Then in the line a = [1, 2, 3] there is now performance overhead.

I wonder if for this reason one should never use a = [1, 2, 3] and instead always use a(:) = [1, 2, 3], even in:

real :: a(3)
...
a(:) = [1, 2, 3]

So that when the declaration is later changed to allocatable, there is no unexpected overhead.

certik avatar Apr 06 '20 16:04 certik

The technique of using a(:) instead of a to avoid checks for reallocation when one knows they aren't needed is good. However I'd be uncomfortable recommending using a(:) in the case a were not allocatable not knowing what object code compilers will generate. Superficially they are the same thing, but they are technically very different.

nncarlson avatar Apr 06 '20 17:04 nncarlson

I don't think the compiler distinguishes between a and a(:) for non-allocatable arrays. I have never heard of performance degradation due to that. However, the other is very easy to check...

Generally I use it, simply because it is an easy documentation reminder of the array dimensions ;)

zerothi avatar Apr 06 '20 18:04 zerothi

What's worse, say the code looks like this:

real :: a(3)
...
a = [1, 2, 3]

and then later you decide to change the declaration to allocatable:

real, allocatable :: a(:)
allocate(a(3))
...
a = [1, 2, 3]

Then in the line a = [1, 2, 3] there is now performance overhead.

Is the overhead due to a run-time check that the RHS has the same shape as the LHS, or due to the reallocation on assignment always happens, even if the shape is the same?

If the latter, that's compiler dependent, no? A compiler should be able to check at run-time if the array needs reallocating.

I wonder if for this reason one should never use a = [1, 2, 3] and instead always use a(:) = [1, 2, 3]

What happens if the re-allocation is desired, for example if size(a) == 3 and then you do a(:) = [1, 2, 3, 4]? In this case, I want re-allocation and not a run-time error.

milancurcic avatar Apr 06 '20 18:04 milancurcic

What's worse, say the code looks like this:

real :: a(3)
...
a = [1, 2, 3]

and then later you decide to change the declaration to allocatable:

real, allocatable :: a(:)
allocate(a(3))
...
a = [1, 2, 3]

Then in the line a = [1, 2, 3] there is now performance overhead.

Is the overhead due to a run-time check that the RHS has the same shape as the LHS, or due to the reallocation on assignment always happens, even if the shape is the same?

Basically

..., allocatable :: a
a = [1, 2, 3]

gets translated to this

logical :: dealloc
dealloc = .false.
do i = lbound(a), ubound(a)
dealloc = dealloc .or. size(a,dim=i) /= size(RHS, dim=i)
end do
if ( dealloc ) then
deallocate(a)
allocate(a, mold=RHS)
end if
a(...) = RHS(...)

If the latter, that's compiler dependent, no? A compiler should be able to check at run-time if the array needs reallocating.

I have seen the same behaviour on both Intel and GCC, but yes, compilers could handle this differently.

I wonder if for this reason one should never use a = [1, 2, 3] and instead always use a(:) = [1, 2, 3]

What happens if the re-allocation is desired, for example if size(a) == 3 and then you do a(:) = [1, 2, 3, 4]? In this case, I want re-allocation and not a run-time error.

Well you have nothing to do, then it won't allow reallocation. It isn't allowed by the specification (as far as I remember).

zerothi avatar Apr 06 '20 18:04 zerothi

@zerothi Okay, got it, that's clear now. So this:

I wonder if for this reason one should never use a = [1, 2, 3] and instead always use a(:) = [1, 2, 3]

applies only when we know reallocation is not intended, which I didn't make the connection at first.

milancurcic avatar Apr 06 '20 18:04 milancurcic

applies only when we know reallocation is not intended, which I didn't make the connection at first.

exactly.

zerothi avatar Apr 06 '20 18:04 zerothi

I have seen the same behaviour on both Intel and GCC, but yes, compilers could handle this differently.

@zerothi how could compilers handle it differently?

certik avatar Apr 06 '20 19:04 certik

There could be flags that disable such checks and force array dimensions to be the same. This is just a guess and I don't know if it is there. Hence could ;)

zerothi avatar Apr 06 '20 19:04 zerothi

Both GFortran and Intel have such an option, see, e.g.: https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-standard-realloc-lhs. I didn't know if that's what you were talking about, or something else.

certik avatar Apr 06 '20 19:04 certik

Both GFortran and Intel have such an option, see, e.g.: https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-standard-realloc-lhs. I didn't know if that's what you were talking about, or something else.

Yes.

zerothi avatar Apr 08 '20 05:04 zerothi

Another important thing regarding loadtxt is, that if the array would be returned, you couldn't overload the function. For example gfortran would throw an error like this:

Error: Ambiguous interfaces in generic interface 'loadtxt' for 'loadtxt_dp' at (1) and 'loadtxt_sp' ...

Since the signature would be the same.

MuellerSeb avatar Apr 19 '22 19:04 MuellerSeb