CAMB icon indicating copy to clipboard operation
CAMB copied to clipboard

NVIDIA/Portland compiler issue

Open lstorchi opened this issue 1 year ago • 22 comments

Describe the bug A clear and concise description of what the bug is. If you just want help, see the CosmoCoffee help forum and help assistant.

When compiling using the NVIDIA/Portland compiler (i.e., nvfortran) there is a compilation issue ,

NVFORTRAN-S-1254-dtauda is use associated with results and cannot be redeclared. (../equations.f90: 10) NVFORTRAN-S-0038-Symbol, dtauda, has not been explicitly declared (../equations.f90: 4) 0 inform, 0 warnings, 2 severes, 0 fatal for dtauda

a possible solution, at least to compile and link camb, is to move the dtauda definition within results.f90

To Reproduce Steps to reproduce (and platform if relevant). e.g. a complete python notebook.

lstorchi avatar Feb 25 '25 14:02 lstorchi

May be a compatibility option that would allow this. However, I doubt it will work with that compiler in general -the Python interface has only been tested with ifort and gfortran (though would be interesting to know if it does work, it is coded in a fairly general way)

cmbant avatar Feb 25 '25 14:02 cmbant

Dear Antony, thanks for your prompt feedback. I know all the three compilers , GNU/INTEL/NVIDIA, and their basic differences as I am using them in a "similar" python/Fortran project, (i.e., BERTHA / PyBERTHA), although there I am using a C wrapper to make it all easier, at least that was my original idea. Yes I think in the CAMB case the python API porting won't be easy, i guess. Reason why, at the moment I am using just the Fortran executable, still there are issues I am currently investigating. I'll try to keep you posted, thanks indeed again for the quick feedback.

lstorchi avatar Feb 26 '25 08:02 lstorchi

While I will try to further debug the code, I am sharing here some of the results related to the compilation using NVFORTRAN, which clearly fails somehow, the code is fine when using GNU. I am using my forks:

https://github.com/lstorchi/CAMB https://github.com/lstorchi/forutils

And while valgrind is returning the following:

==8118== Process terminating with default action of signal 4 (SIGILL) ==8118== Illegal opcode at address 0x478AEF ==8118== at 0x478AEF: camb_camb_getresults_ (camb.f90:101) ==8118== by 0x48BCE8: camb_camb_runfromini_ (camb.f90:1065) ==8118== by 0x48FAC5: camb_camb_commandlinerun_ (camb.f90:1157) ==8118== by 0x404F31: MAIN_ (inidriver.f90:15) ==8118== by 0x404A72: main (in /home/redo/CAMB/fortran/camb) ==8118==

In principle I can easily use a workaround for the P%Max_eta_k=max(min(P%max_l,3000)*2.5_dl,P%Max_eta_k), and I did , but the problem is clearly elsewhere, indeed a  debugging session using gdb shows the following:

Program received signal SIGSEGV, Segmentation fault. 0x0000000000490a22 in mathutils::integrate_romberg (obj=..., fin=-3.2473995081498787e-304, a=0, b=1, tol=3.678794411714424e-08, maxit=<error reading variable: Cannot access memory at address 0x0>, minsteps=<error reading variable: Cannot access memory at address 0x0>, abs_tol=<error reading variable: Cannot access memory at address 0x0>) at ../MathUtils.f90:46 46 gmax=h*(f(obj,a)+f(obj,b))

And here the backtrace:

#0 0x0000000000490a22 in mathutils::integrate_romberg (obj=..., fin=-3.2473995081498787e-304, a=0, b=1, tol=3.678794411714424e-08, maxit=<error reading variable: Cannot access memory at address 0x0>, minsteps=<error reading variable: Cannot access memory at address 0x0>, abs_tol=<error reading variable: Cannot access memory at address 0x0>) at ../MathUtils.f90:46 #1 0x000000000040d7bc in results::cambdata_deltatime (this=..., a1=0, a2=1, in_tol=<error reading variable: Cannot access memory at address 0x0>) at ../results.f90:629 #2 0x000000000040d922 in results::cambdata_timeofz (this=..., z=0, tol=<error reading variable: Cannot access memory at address 0x0>) at ../results.f90:654 #3 0x000000000040b172 in results::cambdata_setparams (this=..., p=..., error=<error reading variable: Cannot access memory at address 0x0>, doreion=<error reading variable: Cannot access memory at address 0x0>, call_again=<error reading variable: Cannot access memory at address 0x0>, background_only=<error reading variable: Cannot access memory at address 0x0>) at ../results.f90:517 #4 0x0000000000478ba3 in camb::camb_getresults (outdata=..., params=..., error=<error reading variable: Cannot access memory at address 0x0>, onlytransfer=<error reading variable: Cannot access memory at address 0x0>, onlytimesources=<error reading variable: Cannot access memory at address 0x0>) at ../camb.f90:108 #5 0x000000000048bce9 in camb::camb_runfromini (ini=..., inputfile=..., errmsg=<error reading variable: value requires 4294952496 bytes, which is more than max-value-size>) at ../camb.f90:1065 #6 0x000000000048fac6 in camb::camb_commandlinerun (inputfile=...) at ../camb.f90:1157 #7 0x0000000000404f32 in driver () at ../inidriver.f90:15 #8 0x0000000000404a73 in main () #9 0x00007ffff602e083 in __libc_start_main (main=0x404a40

, argc=2, argv=0x7fffffffe1b8, init=, fini=, rtld_fini=, stack_end=0x7fffffffe1a8) at ../csu/libc-start.c:308 #10 0x000000000040496e in _start ()

As I said i will keep debugging , still maybe you can easily and better than me guess a possible solution

lstorchi avatar Feb 26 '25 15:02 lstorchi

Do the forutils unit tests pass? If not, probably compiler bug/difference.

Or it's interpretting the dummy object pointer argument to dverk in subroutines.f90 differently.

cmbant avatar Feb 26 '25 15:02 cmbant

Let me just add that while the GNU compiled version of the executable does not produce any errors, still valgrind detect some issues also there

lstorchi avatar Feb 26 '25 15:02 lstorchi

Which version gcc and what error? Have not tested with valgrind recently; but camb unit tests do have a test for memory leaks.

cmbant avatar Feb 26 '25 15:02 cmbant

Thanks for the quick feedback, instead excuse me for the delay in the response but I got some extra time to deal with this project only today . The gfortran I was testing was version 11.4.0. But I need to double check this issue.

Instead just now I had to chance to go further about the nvidia/portland compiler issue, and the issue seems to be quite clearly related to the following:

call C_F_PROCPOINTER(c_funloc(fin), f) in MathUtils.f90 is the source of the problem.

Apparently the function pointer assignment does not work properly. I will try to further track the problem. And maybe ask via the nvidia forum as well . I'll keep you posted

lstorchi avatar Feb 28 '25 15:02 lstorchi

Not super suprising, what the code is doing here is very hacky.

cmbant avatar Feb 28 '25 16:02 cmbant

An easy workaround is to duplicate some code, that considering the final goal, i.e,. to test GPU, it is right now reasonable, indeed it is working. Still there are other issues, for instance the following unsound:

90 if (.not. allocated(ajl) .or. any(ubound(ajl) < [num_xx, max_ix])) then

In bessels.f90

I am trying to track all the issues , maybe it will be useful as a future reference

lstorchi avatar Mar 05 '25 14:03 lstorchi

Clearly a simple split of the clausule using and else if does the job

lstorchi avatar Mar 05 '25 15:03 lstorchi

For the ubound issue as the NVIDIA guy state "this is an issue with the code. Unlike C, Fortran does not enforce left-to-right evaluation nor short circuiting, so you can’t rely on this behavior.", so I guess that specific lie of code should be fixed maybe.

lstorchi avatar Mar 06 '25 09:03 lstorchi

This is true, though the optimizer will normally apply short-circuit evaluation. Easy enough to change e.g.

    if (allocated(ajl)) then
         if (any(ubound(ajl) < [num_xx, max_ix])) deallocate(ajl, ajlpr, ddajlpr)
    end if     
    if (.not. allocated(ajl)) then
        allocate(ajl(1:num_xx,1:max_ix), ajlpr(1:num_xx,1:max_ix), &
            ddajlpr(1:num_xx,1:max_ix))
    end if

cmbant avatar Mar 06 '25 09:03 cmbant

Dear Antony, thanks again for the quick reply. Yes indeed this is what I've done in my local fork, and the code, also including some Integrate_Romberg duplicated functions, is compiling and running on CPU using the nvfortran compiler . The results seems to be plausibly within the expected numerical error when compared to the ones obtained using the GNU compiler.

lstorchi avatar Mar 06 '25 10:03 lstorchi

Good to know. I don't know if there's something better than duplication, e.g. maybe something like TClassDverk declaration I use for the dverk function (which I assume doesn't cause issues).

cmbant avatar Mar 06 '25 10:03 cmbant

Just to keep track of the current issue gere some update from NVIDIA:

"Unfortunately the typical work around for this, i.e. adding an interface for the argument, didn’t work, so there’s more too this one, but highly likely related to this limitation. I added a problem report, TPR#37184.

We are in the process of replacing the current nvfortran with a new flang based Fortran compiler being jointly developed with the LLVM community. While it’s still in development, I was able to successfully build and run your code with this new compiler."

Thus the simple and dirty code duplication is a possible , again simple and dirty, workaround

lstorchi avatar Mar 10 '25 13:03 lstorchi

Just to add a quick update, after some time I got the first version of CAMB not only compiling with the NVIDIA compilare but also running on GPU, I need to work on the performances, but at least all compilation issues seems to be solved.

lstorchi avatar May 05 '25 08:05 lstorchi

Still there are issues related to the OpenMP part of the code, it seems that there is "something wrong in how the polymorphic types are getting set up in a few of the OpenMP regions". Issues related with the actual version of the Nvidia/PGI compiler

Specifically, when I try to compile the CAMB code using OpenMP, with or without OpenACC directives, thus adding one of the following flags combination : -mp -fopnemp; -mp ; -fopenmp; -mp=multicore -fopenmp ; -mp=multicore the code is crashing:

0x00007ffff61af615 in pgf90_extends_type_of_i8 () from /opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/lib/libnvf.so (cuda-gdb) bt #0 0x00007ffff61af615 in pgf90_extends_type_of_i8 () from /opt/nvidia/hpc_sdk/Linux_x86_64/25.1/compilers/lib/libnvf.so

this is related to the following:

recfast.f90 line 522 select type(State)

lstorchi avatar May 16 '25 09:05 lstorchi

Sounds like a compiler bug? I had to report lots of bugs to gfortran befure they were eventually fixed some years ago.

cmbant avatar May 16 '25 09:05 cmbant

Yes indeed it seems , and yes I reported the issue , I hope they can solve it

lstorchi avatar May 16 '25 10:05 lstorchi

As the results of the GPU offload seems to be interesting I started trying to mix OpenMP and OpenACC, and I found some nvidia compiler problems. While waiting for them to fix it, hopefully. I am trying to rewrite some part of the code. I started with the following, in recfast.f90:

subroutine TRecfast_init(this,State, WantTSpin) class(CAMBdata), target :: State instead of class(TCAMBdata), target :: State

So I can avoid the

select type(State) class is (CAMBdata)

But in such a case I get the following:

#0 0x00000000004360cd in cambdata_timeofz (this=..., z=20.030856658965735, tol=0.001) at ../results.f90:1220 #1 0x00000000004480a3 in __nv_results_thermo_init__F1L2377_9 () at ../results.f90:2405

That is:

CAMBdata_TimeOfz= this%DeltaTime(0._dl,1._dl/(z+1._dl), tol)

Any suggestion will be indeed appreciated

lstorchi avatar May 19 '25 13:05 lstorchi

Sorry no idea.

cmbant avatar May 19 '25 15:05 cmbant

The problem in principle seems to be quite obvious:

(cuda-gdb) print this
$2 = ( tcambdata

so here :

#0 0x00000000004360cd in cambdata_timeofz (this=..., z=20.030856658965735, tol=0.001) at ../results.f90:1220

this is TCAMBdata instead the function is expecting a CAMBdata

lstorchi avatar May 21 '25 08:05 lstorchi