Copyright 2007. Los Alamos National Security, LLC. This material was produced under U.S. Government contract DE-AC52-06NA25396 for Los Alamos National Laboratory (LANL), which is operated by Los Alamos National Security, LLC for the U.S. Department of Energy. The U.S. Government has rights to use, reproduce, and distribute this software. NEITHER THE GOVERNMENT NOR LOS ALAMOS NATIONAL SECURITY, LLC MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LIABILITY FOR THE USE OF THIS SOFTWARE. If software is modified to produce derivative works, such modified software should be clearly marked, so as not to confuse it with the version available from LANL.

Additionally, this program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. Accordingly, this program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

           Sandia/LANL DNS code
                Mark Taylor
             [email protected]

=================================================================== README: updated 3/19/2013:
added 'pseudospectral.pdf' to svn repo

README: updated 2/20/2009:

Added support for the SDSC P3DFFT parallel FFT library.
P3DFFT is up to 2x faster than the our internal transpose + FFTW code.
To use P3DFFT, build the "dnsp3" model.
This obsoletes are previous "dnsp" optimized model.

README: updated 2/5/2009:

Added new optimizations to dnsp code to skip on processor transpose_to/from_y operations. To enable, padding must be "0 2 0" and ref decomp must be y-pencil. The vorticity3 trick uses some ffts with this optimization and some ffts without, so one should benchmark with and without "-nov3" To verify that this optimization is enabled, the code will print this message to stdout:

"Ref decomp is y-pencil decomp: skipping transpose_to_y calls"

README: updated 9/26/2008:

Updated to note that for the pure x-y slab decomposition for a problem of size N^3, NCPUS can be as large as N. (we used to have to switch to a pencil decomposition for NCPUS>N/2 )

README: updated 7/22/2008:
minor edits

README: updated 1/20/2008:

We now support the exact dealiasing based on phase shifting + spherical truncation. To measure/test dealiasing error, see testing/dealias.sh.

README: updated 8/16/2007:

For pencil decompositions, we now (8/15/2006) have a more efficient model "dnsp" that should be used instead of "dns" (but dnsp does not yet allow for passive tracers) To compile, type "make dnsp" instead of "make dns", and the executable will be named "dnsp".

===================================================================

This directory and its subdirectories contain the Sandia/LANL DNS code. This code solves viscous fluid dynamics equations in a periodic rectangular domain (2D or 3D) with a pseudo-spectral method or 4th order finite differences and with the standard RK4 time stepping scheme.

The code is documented in [1] Taylor, Kurien and Eyink, Phys. Rev. E 68, 2003. [2] Kurien and Taylor, Los Alamos Science 29, 2005. [3] 'pseudospectral.pdf' included with the source code

The code has options to solve the following equations:

Navier-Stokes (primitive variables) 2D and 3D
Navier-Stokes (stream function/vorticity) 2D only
Lagrangian Averaged Navier Stokes (The Alpha Model) 2D and 3D
Shallow water and Shallow water alpha 2D only
Boussinesque (with stratification) 3D only

In can optionally allow for rotation, arbitrary aspect ratio and an arbitrary number of passive scalars. It is an MPI code and can use a 3D domain decomposition, (allowing for slab decomposition, pencil decomposition or cube decomposition). For a grid of size N^3, it can run on up to N^3/8 processors. It has been run on as many as 18432 processors (with grids up to 4096^3)

This README file contains instructions to compile and run the purest form of the code: Navier-Stokes with deterministic low waver number forcing, use RK4 and a pseudo-spectral method. The equations are solved in a triply periodic cube of side length 1 with no passive scalars or rotation. The deterministic low wave number forcing is a simplified version of Overholt and Pope, Comput. Fluids 27 1998, and is documented in detail in [1]. Some more, but incomplete documentation for this code are in the files: pseudospectral.tex and dns.doc Documentation for running the Boussinesq version is in rotation_doc/rotation.tex.

The steps required to compile and run the code are:

Choose the resolution
set up the grid dimensions
compile the code
edit the input file, forcing12.inp or forcing12-bench.inp
run
validate the results
analysis of output

Step 0. Choose the resolution and resolution requirements.

The parameters set by the user are the grid resolution and viscosity coefficieint. The forcing we use is such that epsilon=3.58 and KE=1.89, and so the eddy turnover time is 1.05. (These values are from N=1024^3. At other values of N, they may change slightly) But if we assume these values of KE and epsilon, we can then determine the correct viscosity to use for a given N:

grid of size: N^3 resolution constraint: G = eta*kmax

Suggested value of G for this problem (forced low wave number turbulence) G=1.0. (For improved resolution in the viscous regime, some people use the more restrictive G=1.5)

The code supports several types of dealasing:

partial spherical (the most efficient, not fully dealiased)
2/3 rule (the most efficient in terms of retained modes)
phase shifting (the most efficient in terms of maximum spherical wave number)

with phase shifting or partial spherical, kmax = 2piNsqrt(2)/3 with 2/3 dealising: kmax = 2pi*N/3

(the 2pi shows up in the wave number because our box is of side length 1).

Given the choice of G and N, we can then determine the viscosity.

Using that eta = (mu^3/epsilon)^.25, we have:

mu = epsilon**(1/3) * [G/kmax]**4/3

The resulting Taylor Reynolds number:

Rl = KE sqrt(20/(3muepsilon))

Example 1 1024^3, with 2/3 dealiasing and G=1. The viscosity coefficient used should be: mu = 5.5e-5.
The expected Rl = 347

Example 2: 1024^3, with spherical dealiasing and G=1. The viscosity coefficient used should be: mu = 3.5e-5 The expected Rl = 437

Example 3: 64^3, with spherical dealiasing and G=1. The viscosity coefficient used should be: mu = .0014 The expected Rl = 69

Step 1. Set up the grid dimensions.

I use a python script, 'gridsetup.py' which will create the needed 'params.h' file. To run an N^3 problem, with a domain decompostion of nxnynz (so NCPUS=nxnynz):

  ./gridsetup.py  nx ny nz  N N N 0 0 2

The 0,0,2 specifies how the arrays are padded in x,y,z directions. 0,0,2 is required when using P3DFFT. When using the internal parallel fft, 2,0,0 is optimal. P3DFFT requires that nx=1.

Here are some examples: For P3DFFT, which only supports slab and pencil decompositions:

  A 32x32x32 grid, to run using just 1 cpu:
  % cd dns/src
  % ./gridsetup.py 1 1 1 32 32 32 0 0 2

  A 64x64x64 grid, on 4 cpu:
  (parallel decomposition: 1x1x4, so 4 hyperslabs in the z-direction)
  % cd dns/src
  % ./gridsetup.py 1 1 4 64 64 64 0 0 2

  A 64x64x64 grid, on 64 cpu:
  (parallel decomposition: 2x1x32, so a pensil decomposition)
  % cd dns/src
  % ./gridsetup.py 1 2 32 64 64 64 0 0 2

The DNS code's internal parallel FFT supports cube decompositions and also supports slab decompositions up to size N: % ./gridsetup.py 1 1 4 64 64 64 2 0 0 % ./gridsetup.py 2 2 4 64 64 64 2 0 0 % ./gridsetup.py 1 1 64 64 64 64 2 0 0

For a grid N^3, N must be a power of 2,3,5, and N/nx, N/ny and N/nz must be an even integers (with one exception for the case nx=1, ny=1, nz=N, which is allowed).

There are some other restrictions because of how we wrote our transpose routines. gridsetup.py will issue warnings if they are violated, and, if the code is run it will print error messages and stop.

If you do not have python installed on your system, you can instead copy the file 'params.h.test' to params.h' and then edit params.h by hand. By default, it is set up to run a 32^3 simulation on 1 cpu.

NOTE: The most efficient configuration for large processor counts will be with ny=1 and nx<nz. For example, to run a 4096^3 simulation on 8192 processors, I would modify the instructions below to:

./gridsetup.py 1 4 2048 4096 4096 4096 0 0 2 Other possibilities which might be more efficient: (we need a performance model :-) ./gridsetup.py 1 8 1024 4096 4096 4096 0 0 2 ./gridsetup.py 1 16 512 4096 4096 4096 0 0 2 ./gridsetup.py 1 32 256 4096 4096 4096 0 0 2 ./gridsetup.py 1 64 128 4096 4096 4096 0 0 2

Step 2. Compilation

A makefile is included which should run on Linux, SGI, OSF1 (Compaq) AIX and SunOS, but some editing will probably be required. On Linux, the makefile by default used the Intel F90 compiler, but you can edit the file and switch this to Lahey or PGI. For the other systems, it uses the vendor supplied F90 compiler.

P3DFFT: For the best performance with high resolution runs on large processor counts, use the version of the code that used the SDSC p3dfft() parallel FFT. Both P3DFFT and FFTW must be built and installed in advance. Currently we require that P3DFFT be built double precision and with the -DSTRIDE1 option. You must also edit the makefile to build with -DUSE_P3DFFT -DUSE_FFTW and configure appropriate include and lib paths. Then:

  % cd dns/src
  % make dnsp3

To use the DNS code's internal parallel FFT:

  % cd dns/src
  % make dns

There is also an optimized version of dns, "dnsp", with uses the DNS code's internal parallel FFT and is but limited to pencil decompositions like P3DFFT. But "dnsp" is no longer supported since "dnsp3" is faster.

For low resolutions on moderate numbers of processors, there is almost no difference between dns, dnsp and dnsp3.

Step 3. Edit the input file

There are two choices for input files for this forced case in
the src directory:

  forcing12.inp         runs 1 eddy turnover time, with output

  forcing12-bench.inp   runs 5 timesteps, only diagnostic output
                        reports cpu time per timestep, averaged
                        over the last 4 timesteps.
                         
Parameters of interest:

viscosity coefficient mu (line 9)    change to value computed above
                                     in step 0.
derivative method (line 14)          set to "fft-dealias" for 2/3 rule 
                                     (exact dealiasing)
                                     or "fft-sphere" for spherical dealising
                                     (partial dealiasing, suitable for k^-5/3 or steeper spectra)
                                     or "fft-phase" for phase shifted + spherical dealising
                                     (exact dealiasing, 2x more expensive) 
                                        
time to run (line 26)                time is measured in dimensional units
                                     For this problem, 1 eddy turnover time
                                     (after the code equilibriates) is 
                                     close to 1 dimensional time, so set
                                     this to 1.0

Step 4. Run the code: To run the code on a single processor:

    ./dnsp3  -i forcing12-bench.inp output_name  

using the input file 'forcing12-bench.inp'.  This runs a triply 
periodic forced turbulence problem.  The forcing is in
wave numbers 1 and 2. 

The output files are:

    output_name0000.0000.u         u component of velocity
    output_name0000.0000.v         v component of velocity
    output_name0000.0000.w         w component of velocity

    output_name0000.0000.spec      power spectrum (1D and spherical)
    output_name0000.0000.spect     transfer spectrum

    output_name0000.0000.scalars         KE, dissipation rates, etc...
    output_name0000.0000.scalars-turb    skewness, other scalars...

where 0000.0000 is the time of the snapshot.  So for t=0.25,
the filename would be:  output_name0000.2500.u.


For a parallel run:
    mpirun -np X  ./dnsp3 -mio -i forcing12.inp output_name  

The "-mio" option will turn on MPI-IO.  Without MPI-IO, all output
will be funneled through processor 0.  With MPI-IO, the code will
still produce identical files, but will have up to M
processors write data with asynchronious non-overlapping
writes.  The default value of M is 32, but it can be tweaked by
editing subroutine "mpi_io_init".   M can be set to be equal to
the number of processes if your MPI-IO library and/or parallel file
system has good I/O aggregation.


These output files are also used as restart files.  
The restart is exact. To do a restart run copy (or create links)
the snapshot to used to restart.u, restart.v and restart.w,
and then add the "-r" option when running the code.

Step 5. Validate the results.

There are several test cases and scripts which run short problems and make sure the output is identical to the reference solutions. (These are not yet documented.)

A quick test: run the 64^3 problem, with the forcing12.inp input file, with mu=.0014, "fft-sphere", and time=1.0.

To run this test on 2 processors: cd src ./gridsetup.py 1 1 2 64 64 64 2 2 0 make dnsp3 ./dnsp3 -i forcing12.inp

To run on more processors, see scripts/readme.job

At simulation time 1.0, the data (given in stdout) should be approximatly:

*** dns code w/ FFT99 ******************************************************** AMD Athlon PGI compiler g95 AMD X2 FC4 FC6 pentium M powerPC F7 gfortran
intel gfortran Linux RHEL4 MAC OS10.4 2.1GHZ DDR2-800 2cpu 1 core 2 core kmax*eta = .9990 1.0214 .9839 .9626 1.0214 same
KE =1.6800 1.6265 1.6828 1.7197 1.6265
R_lambda =61.47 62.22 59.7 58.4 62.22
run time 20min 22.8 66min 12.65 8.41 4.96

2.4ghz quad core, DDR2-666, gfortran, 4 cores: 1.30min

*** dnsp code w/ FFT99 ******************************************************

        intel compiler  PGI compiler    g95         intel on Thunderbird
          AMD Athlon    pentium M      2ghz powerPC   intel x86_64
          Linux FC4     Linux RHEL4    MAC OS10.4     Linux

CPUS = 1cpu 2cpu 1x1x32 2x1x32 4x1x32 kmax*eta = .9626 same .9990 SAME SAME KE = 1.7197 1.6801 R_lambda = 58.4 61.47 d/dt vis = -4.1263 -3.5569
d/dt f = 4.3034 4.5351 run time = 14.5m 11.4m .42m

Other collected run times: 2.4ghz Xeon Linux, ifort, 1 cpu: 13.81 2.4ghz quad core, DDR2-666, gfortran, 1 cores: 2.97 2.4ghz quad core, DDR2-666, gfortran, 4 cores: 1.26min

*** dnsp3 code (P3DFFT) ************************************************ 2.4ghz quad core, DDR2-666, gfortran, 4 cores: 1.23min

kmax*eta = 0.9641 KE = 1.6525 R_lambda = 58.9 d/dt vis = -3.747 d/dt f = 4.122

The initial condition has random phases, and so these results are compiler and OS dependent. I haven't done detailed sensitivity study, but you can get a sence of the fluctuations by looking at the above numbers.

Step 6. Looking at the data.

Not yet documented - contact Mark Taylor ([email protected]) for help.

There are matlab scripts in the dns/matlab directory for reading and processing all the output produced by the code.

Some more complex analysis, such as computing structure functions and PDFs is done with fortran programs in the dns/src directory.

dns
dns copied to clipboard

Metadata

← Metadata

Owner

Metadata

dns dns copied to clipboard

Metadata

← Metadata

Owner

Metadata

dns
dns copied to clipboard