Two-stage compile has 0 file reported at the second stage
When two-stage compilation is enabled, the build failed for LFRic gungho at the moment with an error /scratch/hc46/hc46_gitlab/lfric_fab/gungho_model-mpif90-ifort/build_output/_prebuild/physics_mappings_alg_mod.cb6a16b9a.o not found at link stage.
It is observed that the following messges are logged during the compile_fortran step:
Starting two-stage compile: mod files, multiple passes
...
Finalising two-stage compile: object files, single pass
...
stage 2 compiled 0 files
There are actually a number of compiling commands being executed between Finalising two-stage compile: object files, single pass and stage 2 compiled 0 files. So I am wondering whether there is a bug in updating the build tree.
The build log is here: https://git.nci.org.au/bom/ngm/lfric/lfric_atm-fab/-/jobs/80826
The number of files ** reported in 'stage 2 compiled ** files` is incorrect, I have a fix. But compilation in two phases works for me (the mechanism works, it is just printing the wrong number).
Looking at the log file, I can't see a problem - the file is compiled twice (first only for interfaces, then fully), used in the right way in the linking command. My feeling is that this crash is unrelated? Don't we have an issue with the CI that if two of them are running at the same time they'll overwrite each other's data?
There seems to be two additional problem with two-stage compilation (besides reporting the wrong number):
- [ ] Compilation errors in stage 2 are not picked up and appear to be just ignored.
- [ ] The compilation errors which is triggered in stage 2 seems to be related to finding module file. I believe that this happens when two files are compiled at about the same time, one of which needing the mod file from the other. Since at that time the new mod file is being written, the
useing one can't compile and aborts.
The first one needs some debugging (there seems to be error handling??), for the second one, my current idea is to redirect the module files of stage 2 to a different (temporary) directory, so the mod files from stage 1 are not overwritten.
The error message are indeed that it can't read a module file ('Error in readingthe compiled module file'), which would fit the above suspicion.
I also seem to see that somehow the exception about the error is lost? The return code from compilation is confirmed to be 1, meaning an exception will be raised from Tool.run
I've added the following logging:
try:
logger.debug(f'CompileFortran compiling {analysed_file.fpath}')
compile_file(analysed_file.fpath, flags,
output_fpath=obj_file_prebuild,
mp_common_args=mp_common_args)
logger.debug(f"No error for {analysed_file.fpath} ---------------------")
except Exception as err:
logger.debug(f'CompileFortran compiling {analysed_file.fpath} ERROR {err.value}')
return Exception(f"Error compiling {analysed_file.fpath}:\n"
f"{err}"), None
And grepping for the filename, I see:
CompileFortran compiling /home/903/jxh903/fab-workspace/gungho_model-mpif90-ifort/build_output/algorithm/physics/physics_mappings_alg_mod.f90
run_command: mpif90 -warn all -gen-interfaces nosource -O2 -fp-model=strict -stand f08 -c -qopenmp -warn all -gen-interfaces nosource -O2 -fp-model=strict -stand f08 -g -traceback -module /home/903/jxh903/fab-workspace/gungho_model-mpif90-ifort/build_output physics_mappings_alg_mod.f90 -o /home/903/jxh903/fab-workspace/gungho_model-mpif90-ifort/build_output/_prebuild/physics_mappings_alg_mod.9eead9a44.o
Running 1 <---------------RETURN CODE is 1
STDERR IS:
'physics_mappings_alg_mod.f90(10): error #7005: Error in reading the compiled module file. [GALERKIN_PROJECTION_ALGORITHM_MOD]
...
physics_mappings_alg_mod.f90(159): catastrophic error: Too many errors, exiting
compilation aborted for physics_mappings_alg_mod.f90 (code 1)
That's it. I see neither the "No error" nor the "CompileFortran compiling ... ERROR" at all??? Debug print do indeed confirm that tool.run raises the exception.
I tried to add error capturing to the compile_fortran step def compile_file function that calls the compiler compile_file function, as is shown below. This does not seem to work looking at the latest lfric_baf build: https://git.nci.org.au/bom/ngm/lfric/lfric_atm-fab/-/jobs/81152. Don't know why the error is suppressed. Should we add the error capturing to compiler.compile_file instead?
try:
compiler.compile_file(input_file=analysed_file, output_file=output_fpath,
openmp=config.openmp,
add_flags=flags,
syntax_only=mp_common_args.syntax_only)
except Exception as err:
return Exception(f"Error compiling {analysed_file.fpath}:\n"
f"{err}"), None
OK, I found the error, it is indeed a bug in fab. Why my debug logging messages did not show up ... no idea. My best idea is that apparently python logging uses syslog, and that has a limit of 2b messages (and I added quite a bit of logging, command line parameters, compiler output). Maybe it was just coincidence that the messages I added exceeded the buffer length (and then got chopped off).
Unfortunately, my solution for ifort doesn't work :cry: Parallel compilation for stage2 with ifort still crashes now and again because it is reading an incomplete .mod file. I added a scratch directory for module output path in stage 2 , and an explicit include path to the original stage 1 directory, e.g.:
mpif90 ... -I my_fab_work/build_output -module my_fab_work/build_output/modules_second_stage create_wthetamask_lbc_kernel_mod.f90 -o my_fab_work/build_output/_prebuild/create_wthetamask_lbc_kernel_mod.43e4c407d.o
So it adds build_output using -I, then build_output/modules_second_stage as -module (to store the newly created modules in stage 2).
Problem seems to be that ifort always searches in the -module path first:
Simple reproducer, where the directory gfortran and ifort contain mod1.mod compiled with the corresponding compiler:
ifort -c -I ./ifort/ -module gfortran/ ./mod2.f90
./mod2.f90(2): error #7013: This module file was not generated by any release of this compiler. [MOD1]
use mod1, only: mod1_a
------------^
Removing the -module fixes the failure:
$ ifort -c -I ./ifort/ ./mod2.f90
$
I'll try to ask Intel. For now best solution: in phase 2, let each compile process write in its own directory. Therefore no compilation process will ever read anything else from the module path. That means a lot of directories, each with one file (using the source code filename as unique directory name)
The best option is to change the message and disable two pass compilations for the intel compiler