Out-of-memory error on Derecho with `FWHIST` `f09_f09_mg17` and a lot of `fincl2` elements
What happened?
I'm not sure if this is a bug or me trying to output too much data. This is what happened - a simulation failed and an out-of-memory error appeared in the CESM log file.
The failure only happens when I try to output the photolysis rate constants and species concentrations for the TS1 mechanism (~300 fields). A simulation with an unmodified user_nl_cam file runs to completion, as do runs that split this output into three sets.
Have I pushed fincl2 past the breaking point?
What are the steps to reproduce the bug?
git clone -b cam_development https://github.com/ESCOMP/CAM.git cam-development
cd cam-development
./manage_externals/checkout_externals
cd cime/scripts
export CASE_DIR=/glade/work/$USER/my-troubled-case
./create_newcase --compset FWHIST --res f09_f09_mg17 --case $CASE_DIR --project P12345678
cd $CASE_DIR
./case.setup
At this point, I modified the user_nl_cam file to output the photolysis rate constants and species concentrations by adding:
fincl2 = 'jh2o_b', 'jh2o_a', 'jh2o_c', 'jh2o2', 'jo2_a', 'jo2_b', 'jo3_a', 'jo3_b', 'jhno3', 'jho2no2_a', 'jho2no2_b', 'jn2o', 'jn2o5_a', 'jn2o5_b', 'jno', 'jno2', 'jno3_b', 'jno3_a', 'jalknit', 'jalkooh', 'jbenzooh', 'jbepomuc', 'jbigald', 'jbigald1', 'jbigald2', 'jbigald3', 'jbigald4', 'jbzooh', 'jc2h5ooh', 'jc3h7ooh', 'jc6h5ooh', 'jch2o_b', 'jch2o_a', 'jch3cho', 'jacet', 'jmgly', 'jch3co3h', 'jch3ooh', 'jch4_b', 'jch4_a', 'jco2', 'jeooh', 'jglyald', 'jglyoxal', 'jhonitr', 'jhpald', 'jhyac', 'jisopnooh', 'jisopooh', 'jmacr_a', 'jmacr_b', 'jmek', 'jmekooh', 'jmpan', 'jmvk', 'jnc4cho', 'jnoa', 'jnterpooh', 'jonitr', 'jpan', 'jphenooh', 'jpooh', 'jrooh', 'jtepomuc', 'jterp2ooh', 'jterpnit', 'jterpooh', 'jterprd1', 'jterprd2', 'jtolooh', 'jxooh', 'jxylenooh', 'jxylolooh', 'jbrcl', 'jbro', 'jbrono2_b', 'jbrono2_a', 'jccl4', 'jcf2clbr', 'jcf3br', 'jcfcl3', 'jcfc113', 'jcfc114', 'jcfc115', 'jcf2cl2', 'jch2br2', 'jch3br', 'jch3ccl3', 'jch3cl', 'jchbr3', 'jcl2', 'jcl2o2', 'jclo', 'jclono2_a', 'jclono2_b', 'jcof2', 'jcofcl', 'jh2402', 'jhbr', 'jhcfc141b', 'jhcfc142b', 'jhcfc22', 'jhcl', 'jhf', 'jhobr', 'jhocl', 'joclo', 'jsf6', 'jh2so4', 'jocs', 'jso', 'jso2', 'jso3', 'jsoa1_a1', 'jsoa1_a2', 'jsoa2_a1', 'jsoa2_a2', 'jsoa3_a1', 'jsoa3_a2', 'jsoa4_a1', 'jsoa4_a2', 'jsoa5_a1', 'jsoa5_a2', 'ALKNIT', 'ALKOOH', 'AOA_NH', 'bc_a1', 'bc_a4', 'BCARY', 'BENZENE', 'BENZOOH', 'BEPOMUC', 'BIGALD', 'BIGALD1', 'BIGALD2', 'BIGALD3', 'BIGALD4', 'BIGALK', 'BIGENE', 'BR', 'BRCL', 'BRO', 'BRONO2', 'BRY', 'BZALD', 'BZOOH', 'C2H2', 'C2H4', 'C2H5OH', 'C2H5OOH', 'C2H6', 'C3H6', 'C3H7OOH', 'C3H8', 'C6H5OOH', 'CCL4', 'CF2CLBR', 'CF3BR', 'CFC11', 'CFC113', 'CFC114', 'CFC115', 'CFC12', 'CH2BR2', 'CH2O', 'CH3BR', 'CH3CCL3', 'CH3CHO', 'CH3CL', 'CH3CN', 'CH3COCH3', 'CH3COCHO', 'CH3COOH', 'CH3COOOH', 'CH3OH', 'CH3OOH', 'CH4', 'CHBR3', 'CL', 'CL2', 'CL2O2', 'CLO', 'CLONO2', 'CLY', 'CO', 'CO2', 'COF2', 'COFCL', 'CRESOL', 'DMS', 'dst_a1', 'dst_a2', 'dst_a3', 'E90', 'EOOH', 'F', 'GLYALD', 'GLYOXAL', 'H', 'H2', 'H2402', 'H2O2', 'H2SO4', 'HBR', 'HCFC141B', 'HCFC142B', 'HCFC22', 'HCL', 'HCN', 'HCOOH', 'HF', 'HNO3', 'HO2NO2', 'HOBR', 'HOCL', 'HONITR', 'HPALD', 'HYAC', 'HYDRALD', 'IEPOX', 'ISOP', 'ISOPNITA', 'ISOPNITB', 'ISOPNO3', 'ISOPNOOH', 'ISOPOOH', 'IVOC', 'MACR', 'MACROOH', 'MEK', 'MEKOOH', 'MPAN', 'MTERP', 'MVK', 'N', 'N2O', 'N2O5', 'NC4CH2OH', 'NC4CHO', 'ncl_a1', 'ncl_a2', 'ncl_a3', 'NH3', 'NH4', 'NH_5', 'NH_50', 'NO', 'NO2', 'NO3', 'NOA', 'NTERPOOH', 'num_a1', 'num_a2', 'num_a3', 'num_a4', 'O', 'O3', 'O3S', 'OCLO', 'OCS', 'ONITR', 'PAN', 'PBZNIT', 'PHENO', 'PHENOL', 'PHENOOH', 'pom_a1', 'pom_a4', 'POOH', 'ROOH', 'S', 'SF6', 'SO', 'SO2', 'SO3', 'so4_a1', 'so4_a2', 'so4_a3', 'soa1_a1', 'soa1_a2', 'soa2_a1', 'soa2_a2', 'soa3_a1', 'soa3_a2', 'soa4_a1', 'soa4_a2', 'soa5_a1', 'soa5_a2', 'SOAG0', 'SOAG1', 'SOAG2', 'SOAG3', 'SOAG4', 'ST80_25', 'SVOC', 'TEPOMUC', 'TERP2OOH', 'TERPNIT', 'TERPOOH', 'TERPROD1', 'TERPROD2', 'TOLOOH', 'TOLUENE', 'XOOH', 'XYLENES', 'XYLENOOH', 'XYLOL', 'XYLOLOOH', 'NHDEP', 'NDEP', 'ACBZO2', 'ALKO2', 'BCARYO2VBS', 'BENZO2', 'BENZO2VBS', 'BZOO', 'C2H5O2', 'C3H7O2', 'C6H5O2', 'CH3CO3', 'CH3O2', 'DICARBO2', 'ENEO2', 'EO', 'EO2', 'HO2', 'HOCH2OO', 'ISOPAO2', 'ISOPBO2', 'ISOPO2VBS', 'IVOCO2VBS', 'MACRO2', 'MALO2', 'MCO3', 'MDIALO2', 'MEKO2', 'MTERPO2VBS', 'NTERPO2', 'O1D', 'OH', 'PHENO2', 'PO2', 'RO2', 'TERP2O2', 'TERPO2', 'TOLO2', 'TOLUO2VBS', 'XO2', 'XYLENO2', 'XYLEO2VBS', 'XYLOLO2', 'H2O'
Then, I built and ran with:
qcmd -A P12345678 -- ./case.build
./case.submit
The CESM log file contains this:
dec1757.hsn.de.hpc.ucar.edu 897: MPICH ERROR [Rank 897] [job id 5b58b464-ae86-48c0-92b5-3457e8f7c803] [Mon Nov 20 15:01:57 2023] [dec1757] - Abort(403837199) (rank 897 in comm 0): Fatal error in PMPI_Type_create_struct: Other MPI error, error stack:
dec1757.hsn.de.hpc.ucar.edu 897: PMPI_Type_create_struct(166)....: MPI_Type_create_struct(count=138880140, array_of_blocklengths=0x14f8cc7fb010, array_of_displacements=0x14f88a33b010, array_of_types=0x14f8690db010, newtype=0x7ffef5e38fac) failed
dec1757.hsn.de.hpc.ucar.edu 897: MPIR_Type_create_struct_impl(64):
dec1757.hsn.de.hpc.ucar.edu 897: MPIR_Datatype_set_contents(551).: Out of memory
dec1757.hsn.de.hpc.ucar.edu 897:
dec1757.hsn.de.hpc.ucar.edu 897: aborting job:
What CAM tag were you using?
cam6_3_136
What machine were you running CAM on?
CISL machine (e.g. cheyenne)
What compiler were you using?
Intel
Path to a case directory, if applicable
/glade/derecho/scratch/mattdawson/FWHIST-f09_f09_mg17-cam6_3_136
Will you be addressing this bug yourself?
No
Extra info
I don't mind helping address this bug, but I may not be the most efficient option