neural-fortran icon indicating copy to clipboard operation
neural-fortran copied to clipboard

Whither test_network_sync?

Open rouson opened this issue 2 years ago • 10 comments

After attempting to develop test_network_sync.f90 into a test that reports passing or failure and seeking some advice Tom Clune, who develops the pFUnit testing framework, I'm not confident there's a way to write a test that is both meaningful straightforward. Without doing extensive research myself, I have the sense is that the formula employed in the randn functions has been thoroughly studied and there's not much to test in its behavior beyond what is already known. @milancurcic if you have thoughts on a useful test, please let me know. Otherwise, I recommend removing the test, given that it currently just prints the generated samples without any form of check on the their statistical properties.

rouson avatar Apr 20 '22 02:04 rouson

Also, the sync type-bound procedure suffers from an unfortunate ambiguity in its name. I understand the reason for the name, but I'd just call it broadcast or something similar because there's no explicit synchronization (sync all or sync images) and the standard imposes no precise synchronization semantics on collective subroutines other than noting that "the calculations performed by a collective subroutine have some internal synchronizations."

rouson avatar Apr 20 '22 02:04 rouson

Yes, I wouldn't worry about this one as it will likely go away. I suggest focusing on io, datasets, and activations.

milancurcic avatar Apr 20 '22 03:04 milancurcic

I looked at what sync does to remind myself. It updates the weights and biases on all images > 1 to match those on image 1. A passing test for this would ensure they are indeed exactly the same after the call to sync.

milancurcic avatar Apr 20 '22 03:04 milancurcic

@milancurcic perfect. Thanks. I'll submit a PR with the suggested test.

rouson avatar Apr 20 '22 17:04 rouson

@milancurcic It still might be that test_network_sync is unnecessary. Fortran 2018 provides an intrinsic subroutine: random_init(repeatable, image_distinct), where both arguments are logical, intent(in) and where passing image_distinct=.true. ensures that each image uses the same sequence of random numbers. At scale, some amount of redundant computation generally outperforms communication so the code will likely scale better by replacing co_broadcast with a call to random_init with image_distinct=.false..

The GNU, Intel, and Cray compilers support random_init. Some important updates to the behavior of random_init will appear in GCC 12, however, so I'll need to check whether GCC 11 works for our purposes. I just submitted a feature request to NAG, which usually response quickly so there's a reasonable chance that the NAG compiler will support random_int soon too.

rouson avatar Apr 20 '22 21:04 rouson

Actually yet another way to view this is that the test is useful either way and in fact, the test should give the same result either way. How that result is produced is an implementation detail that can be considered separately.

rouson avatar Apr 20 '22 23:04 rouson

Yes, in other words, ensuring that the initial state of the network is valid across images.

milancurcic avatar Apr 20 '22 23:04 milancurcic

I have determined that with gfortran 11.2.0 and the head of the main branch of OpenCoarrays, calling random_init with image_distinct=.false. produces the same random numbers on each image without need for co_broadcast. See below. I haven't tested earlier versions of gfortran or OpenCoarrays.

I recommend this approach for scalability reasons. I'll submit a PR with separate commits that add the test and replace the co_broadcast with random_init. I'm thinking of setting repeatable=.true. so that test results is deterministic. The current test is non-deterministic so if you prefer that approach, let me know. I suppose we could also think about having a deterministic test and a non-deterministic test. I'm not sure it matters much either way given that our only goal is to ensure the same weights on all images.

% cat random.f90 
program main
  implicit none
  real harvest(5)
  call random_init(.true.,.false.)
  call random_number(harvest)
  print *, this_image(), ":", harvest
end program

% caf random.f90 

% cafrun -n 8 ./a.out
           5 :  0.571258783      0.226831257      0.175722122      0.691333890      0.885308683    
           6 :  0.571258783      0.226831257      0.175722122      0.691333890      0.885308683    
           7 :  0.571258783      0.226831257      0.175722122      0.691333890      0.885308683    
           8 :  0.571258783      0.226831257      0.175722122      0.691333890      0.885308683    
           1 :  0.571258783      0.226831257      0.175722122      0.691333890      0.885308683    
           2 :  0.571258783      0.226831257      0.175722122      0.691333890      0.885308683    
           3 :  0.571258783      0.226831257      0.175722122      0.691333890      0.885308683  

rouson avatar Apr 21 '22 19:04 rouson

Thanks @rouson. Any chance for this to work with GFortran 10?

milancurcic avatar Apr 25 '22 17:04 milancurcic

Most likely, yes. Certainly the current approach should work with GFortran 10. If random_int works with 10, I'll submit that approach as a separate PR.

rouson avatar Apr 25 '22 19:04 rouson

Closing in favor of #78.

milancurcic avatar Jul 14 '23 17:07 milancurcic