atlas icon indicating copy to clipboard operation
atlas copied to clipboard

Gather scatter

Open pmarguinaud opened this issue 5 years ago • 13 comments

This is a proposal for improving the current gather/scatter methods.

Some minor modifications have to be applied to Atlas :

  • fields & views of type char
  • atlas::vector dimensioned and indexed by size_t
  • add a size method to distribution

I have assumed that for a distributed field, jloc1 < jloc2 <=> jglo1 < jglo2, jglo being the global indice, and jloc the local indice (on the current task), but I can do without this assumption.

The proposal, although not finalized yet, makes it possible :

  • to process fields of any type, any rank
  • to distribute evenly fields with multiple dimensions (multiple levels or variables)
  • could be extended to handle fields with NPROMA, NGPBLKS dimensions
  • is much faster than the current implementation; on my reference test case (2.5km global, 300 fields, the gather takes 25s vs 3 minutes on 4 Rome nodes)

The proposal includes two new features (these should be implemented as new methods for views) :

  • convert any view to a view of byte elements, simply by creating a view with an extra dimension
  • drop a dimension of a view; such as in Fortran : if ZFLD is an array of dimension 4, then it possible to derive arrays of dimension 3 (eg ZFLD (:,:,4,:)), simply by setting the value of a dimension

These templates are included in this file : arrayViewHelpers.h, and make it possible to recursively build a list of views of type atlas::array::ArrayView<byte,2> from any view (any type, any rank). The first dimension of atlas::array::ArrayView<byte,2> is the fastest varying dimension of the original view, and the second has a length which corresponds to the type size.

The GatherScatter class handles only atlas::array::ArrayView<byte,2> views, which makes the code quite simple, as all type and rank issues have already been resolved. This could be even simpler is we could assume that the fastest varying dimension of Atlas fields had a stride ==1 (this should be always the case); in this case, we could restrict ourselves to atlas::array::ArrayView<byte,1> views.

Some further possible improvements :

  • buffer packing/unpacking in parallel of communications
  • multiple dimensions are involved : number of MPI tasks, number of fields, size of global and distributed fields; choosing the better dimension for OpenMP depends on the test case

pmarguinaud avatar May 22 '20 08:05 pmarguinaud

Codecov Report

Merging #52 into develop will decrease coverage by 1.13%. The diff coverage is 16.20%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop      #52      +/-   ##
===========================================
- Coverage    74.45%   73.32%   -1.14%     
===========================================
  Files          522      530       +8     
  Lines        34852    35734     +882     
===========================================
+ Hits         25949    26201     +252     
- Misses        8903     9533     +630     
Impacted Files Coverage Δ
src/atlas/array/native/NativeArray.cc 64.32% <0.00%> (-0.71%) :arrow_down:
src/atlas/array/native/NativeArrayView.cc 100.00% <ø> (ø)
src/atlas/array/native/NativeMakeView.cc 71.42% <ø> (ø)
src/atlas/grid/Distribution.h 50.00% <0.00%> (-50.00%) :arrow_down:
.../atlas/grid/detail/distribution/DistributionImpl.h 63.63% <0.00%> (-6.37%) :arrow_down:
src/tests/gatherscatter/GatherScatter.h 0.00% <0.00%> (ø)
src/tests/gatherscatter/GatherScatter.cc 0.53% <0.53%> (ø)
src/tests/gatherscatter/test_gatherscatter.cc 3.68% <3.68%> (ø)
src/atlas/array/DataType.h 72.22% <27.27%> (-6.53%) :arrow_down:
src/tests/gatherscatter/ioFieldDesc.cc 33.78% <33.78%> (ø)
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f66e439...6724f9f. Read the comment docs.

codecov-commenter avatar May 22 '20 11:05 codecov-commenter

Hi Philippe, thank you for this. It will be a great addition when ready.

Were you however aware of the ArraySlicer in Atlas? It allows you to create these "dropped dimension" views:

using namespace atlas::array;
ArrayView<double,4> ZFLD = ...;

To create a slice as ZFLD(:,:,4,:) :

auto slice = ZFLD.slice( Range::all(), Range::all(), 4, Range::all() );

slice arguments can be a Range type or an integral value. To add an extra dummy dimension of size 1 which could be at any position (in following example innermost):

auto expanded = ZFLD.slice(Range::all(),Range::all(),Range::all(),Range::dummy());

Check the following for the Range functions: https://github.com/ecmwf/atlas/blob/develop/src/atlas/array/Range.h#L97-L100

wdeconinck avatar May 26 '20 13:05 wdeconinck

A use-case of a non-stride 1 inner dimension occurs when Atlas will be used to wrap existing allocated data, from e.g. IFS' GFL container.

! Fortran
type(atlas_Field) :: field_GFL_4
field_GFL_4 = atlas_Field( GFL( 1:NPROMA, 1:NLEV, 4, 1:NBLK ) )

It does not copy, but wrap data. Upon field destruction, the data will be left untouched.

You are perfectly right of course that in general this won't be the case, and usually the 2 use cases will not co-exist in the same FieldSet, so that probably we can branch the 2 cases if it makes a performance difference... (did you check how much can be gained?)

Another use case could be the following, where you want to only gather/scatter one component of a distributed wind-field. to a global u-component-field.

FunctionSpace structured_columns = functionspace::StructuredColumns( grid ,
                                        option::levels(nlev));
Field wind = structured_columns.createField<double>("wind",option::variables(2));
    // shape={ ngp(partition), nlev=100, nvar=2} ,  rank=3
Field u_global = structured_columns.createField<double>("u",option::global());
   // shape={ ngp(GLOBAL), nlev=100 }  , rank=2

gather( wind.slice( Range::all(), Range::all(), 0 ).data(), .... , u_global.data(), ... );
    // ugly API that needs to manually provide strides and shapes etc...

In principle we should also be able to read in global fields from GRIB, where the level dimension is outermost and (parallel dimension) grid point innermost, and scatter to distributed fields where levels are innermost.

auto T_global = Field( "T", array::make_datatype<double>(), array::make_shape(nlev, ngp_global));
auto T_local  = Field( "T", array::make_datatype<double>(), array::make_shape(ngp_local, nlev);

scatter(T_global, T_local);
gather(T_local,T_global);

We need to somehow be able to to add metadata to the fields (or functionspace) on what constitutes the (possibly multiple) parallel dimensions.

Is this metadata essentially what the ioFieldDescriptors are? I had attempted something like that with the parallel::Field type (only used within GatherScatter).

wdeconinck avatar May 26 '20 15:05 wdeconinck

Hello Willem,

I have looked at the slicer function which is nice for people working with views a of known rank, but I cannot see how I could use it to reduce a view with N dimensions to a list of views of dimension 1. I have to do that recursively, but I cannot write code such as :

auto slice = ZFLD.slice( Range::all(), Range::all(), 4, Range::all() );

because I need to handle a view of an arbitrary rank, and the code above deals with views of rank 4 only.

Hence I thought I could create a function which removes a dimension to a view, whatever its dimension : auto w = dropDimension (v, 4, 2);

Please look at dropDimension (arrayViewHelpers.h) and reconsider this point, or tell me how to address my basic need which is to create a list of views of dimension 1 from a view of any rank.

Philippe

pmarguinaud avatar May 26 '20 16:05 pmarguinaud

Hello again,

This is the code you wrote :

! Fortran
type(atlas_Field) :: field_GFL_4
field_GFL_4 = atlas_Field( GFL( 1:NPROMA, 1:NLEV, 4, 1:NBLK ) )

In this case, the fastest varying dimension is 1:NPROMA; it is indeed contiguous, and has a stride of 1. This what I would like to assume : that fields involved in I/O have a fastest varying dimension which is contiguous.

Is it reasonable to make this assumption or not ?

Philippe

pmarguinaud avatar May 26 '20 16:05 pmarguinaud

Hello again,

What you ask for :

FunctionSpace structured_columns = functionspace::StructuredColumns( grid ,
                                        option::levels(nlev));
Field wind = structured_columns.createField<double>("wind",option::variables(2));
    // shape={ ngp(partition), nlev=100, nvar=2} ,  rank=3
Field u_global = structured_columns.createField<double>("u",option::global());
   // shape={ ngp(GLOBAL), nlev=100 }  , rank=2

gather( wind.slice( Range::all(), Range::all(), 0 ).data(), .... , u_global.data(), ... );
    // ugly API that needs to manually provide strides and shapes etc...

Looks very feasible to me with what I propose. In fact the ioFieldDescriptor lets the user distribute the fields in any possible way. It would be possible to choose a target processor for any of the 1:NGPTOT or 1:NPROMA/1:NGPBLKS fields.

Philippe

pmarguinaud avatar May 26 '20 16:05 pmarguinaud

You are completely right about the slicing :) I was considering this as well after I wrote my reply, but still wondered if it could be useful or were aware. It will be good to have. Note however that the return type of a slice is of type LocalView. The semantic difference is that the ArrayView gets created from an Array, which contains either NativeDataStore or GridToolsDataStore. The latter is GPU aware, and the ArrayView can then be either on host or device. A LocalView is more straightforward, created as a slice from an ArrayView, but exactly the same API.

This will mess up perhaps the API of gather/scatter which expects an ArrayView. Perhaps that can be templated to accept both LocalView and ArrayView.

wdeconinck avatar May 26 '20 16:05 wdeconinck

I have to admit that I had not this in mind at all :

auto T_global = Field( "T", array::make_datatype<double>(), array::make_shape(nlev, ngp_global));
auto T_local  = Field( "T", array::make_datatype<double>(), array::make_shape(ngp_local, nlev);

scatter(T_global, T_local);
gather(T_local,T_global);

I do not think it happens in ARPEGE/IFS for grid-point fields, only for spectral data we do stuff like that; but spectral data requires a slightly different treatment.

Achieving something like that is probably possible, I need to think about it. It is clear we would need to specify for each field which dimension is the grid index.

In any case, I think the assumption I suggested to make (ie contiguous field dimension) would not resist it.

Philippe

pmarguinaud avatar May 26 '20 16:05 pmarguinaud

Hello Willem,

I can make this code work :

auto T_global = Field( "T", array::make_datatype<double>(), array::make_shape(nlev, ngp_global));
auto T_local  = Field( "T", array::make_datatype<double>(), array::make_shape(ngp_local, nlev);

scatter(T_global, T_local);
gather(T_local,T_global);

It is sufficient that the user explicitly states which dimension is the grid index; then I can drop all other dimensions while reducing the view to a list of 1D views. Same thing for NPROMA/NGPBLKS dimensions; I can easily add a blocking dimension. Note that fields without NPROMA (that is with a dimension 1:NGPTOT) are just a special case (single block, NPROMA=NGPTOT).

To be more explicit, the ioFieldDesc objects are essentially this :

  atlas::array::ArrayView<byte,2> _v;  // pointer to a grid section (1:NGPTOT) x (1:sizeof (double) or sizeof (long), etc.)
  const std::vector<atlas::idx_t> _ind;  // Values of dimensions we had to drop to build this view
  const atlas::Field & _f;             // Reference to the original field (access to its metadata)

ioFieldDesc objects therefore hold a chunk of field data and associated metadata (for fields with a level, the _ind member would hold the level index).

I do not know a lot about the difference between ArrayView and LocalView, but I am pretty sure we can sort out the problem using templates.

I have also made this assumption :

I have assumed that for a distributed field, jloc1 < jloc2 <=> jglo1 < jglo2, jglo being the global indice, and jloc the local indice (on the current task), but I can do without this assumption.

This simplifies things (I do not need a functionspace anymore, but only the distribution object). I would like you to comment on that.

You also address the issue of metadata. These are small and of unpredictable size; therefore, it would be better not to pack them with field data, but rather distribute them using an allgather. The situation would be different for an IO server but we can see that later.

Please also keep in mind that we are breaking the MPI standard (packing double/long data as bytes); but this allows much more flexibility.

pmarguinaud avatar May 27 '20 07:05 pmarguinaud

  • Very nice that the transposition could work! A use case would be e.g. to distribute horizontal levels obtained from e.g. grib into a different transposed or blocked (NPROMA) configuration.

  • The assumption that jloc1 < jloc2 <=> jglo1 < jglo2 is not guaranteed. Especially as I have added mesh renumbering algorithms (Hilbert space filling curve).

  • It would be nice if we did not need to pass a functionspace. Currently one GatherScatter object exists per functionspace. Could we not pass the extra information contained within the functionspace to the GatherScatter setup? We can assume that all fields will share the same functionspace.

  • Why is casting to MPI_CHAR breaking the standard? We are at least not doing any reductions, only message passing.

wdeconinck avatar May 27 '20 08:05 wdeconinck

I will provide some tests/examples with NPROMA/NGPBLKS and NGPTOT/NFLEVG. It will take a few days.

pmarguinaud avatar May 27 '20 13:05 pmarguinaud

I will provide some tests/examples with NPROMA/NGPBLKS and NGPTOT/NFLEVG. It will take a few days.

Thank you Philippe, having this functionality will be really great.

wdeconinck avatar May 27 '20 13:05 wdeconinck

Hello Willem,

I have added two new test cases (I use Fortran order to describe them) :

  • (1:NFLEVG,1:NGPTOT,1:NFLDS) <-> list of (1:NGPTOTG) fields

  • (1:NPROMA,1:NFLEVG,1:NGPBLKS) <-> list of (1:NGPTOTG) fields

Both of them are included in test_gatherscatter.cc. Please look at them and see whether you find that sensible.

There are other points we should sort out :

  • LocalView/ArrayView distinction; not 100% clear to me, but you said I have to do something about it
  • The assumption jloc1 < jloc2 <=> jglo1 < jglo2; how I can do without. Maybe an example where it is not the case would help me.

Regards,

Philippe

pmarguinaud avatar Jun 04 '20 09:06 pmarguinaud