Design output strategy for RBCs
Some considerations:
- Needs to grow along timestep and number of cells.
- Can be XDR or HDF5. Both built on top of MPI I/O.
- Be able to switch between more or less output (e.g. barycentre/normal, all facets).
Create a wikipage for discussion and share with Mayeul and Miguel.
I'm personally leaning towards using NetCDF, possibly with the new C++ API.
All formats seem to use HDF5 underneath or at least something very close to it. These use a hierarchical structure to access data, similar to the way the Unix filesystem works (e.g. a full path to a dataset would be /Group/Subgroup/Dataset). What remains to be decided is how to store the RBC data into groups and datasets. Since plain HDF5 only allows 1 dimension to be expandable (NetCDF expands this to any number of dimensions) this should be the timestep to satisfy the first part of point 1 above. This means that to store any number of cells we would need to create another dataset/group for each cell, each with the same timestep data, e.g.:
/RedBloodCells/<UUID1>
/RedBloodCells/<UUID2>
...
/RedBloodCells/<UUIDn>
If, instead of one dataset per cell, we used one group per cell this would allow us to satisfy the last point (more or less output per cell) e.g.:
/RedBloodCells/<UUID1>/barycenter
/RedBloodCells/<UUID1>/normal
/RedBloodCells/<UUID1>/facets
/RedBloodCells/<UUID1>/...
...
/RedBloodCells/<UUIDn>/barycenter
/RedBloodCells/<UUIDn>/normal
/RedBloodCells/<UUIDn>/facets
/RedBloodCells/<UUIDn>/...
@mdavezac , @mobernabeu does this seem reasonable? Attach any comments/proposals to this bug and I'll update the wiki page once we decide on a format.
Comment by miguel:
Thanks for compiling the information about the file formats. That was very useful. A few comments below: I didn't know there existed two implementations of HDF5, one that supports MPI-IO and one that doesn't. I wouldn't recommend doing IO with a single MPI process as I expect this to generate too much comms. It occurred to me that the number of time steps is known at the beginning of the simulation, so in principle this could be a fixed dimension. When I used HDF5 in the past, I had to support checkpointing and restart and that's why my time step number wasn't a fixed dimension. I like the idea of a group per cell, which contains more or less datasets (barycentre, facets, etc.) depending on the sim configuration. In your design, it is not clear to me how the time step is handled. Could you please clarify?
I think the new NetCDF C++ interface can write HDF5 files using MPI-IO. Either that or we can use the plain C interface to standard HDF5 (with HDF5 you have to make a choice between using C++ or MPI-IO). With multiple datasets per cell each would have to have its own copy of the timesteps as a column in the dataset. Either that or there could be a "global" 1D dataset that contains the timesteps and a lookup could be done using the row indices in the cell datasets (every dataset would have the same number of rows, each corresponding to a timestep). As the timesteps are strictly increasing, HDF5 and formats built on it have fast lookup algorithms for a specific timestep value.
Comment by miguel:
What do you mean with "same number of rows"? We don't know a priori for how many time steps a given cell will exist in the domain. Am I misunderstanding this?
I guess the performance of this operation is key when choosing the design. I would expect that the Python script that generates Paraview visualisations based on HemeLB output files will have to interrogate the HDF5 file to obtain "all cells in the domain at time step X". Do you see any issues with this?
No. I had forgotten that each cell would have different timesteps that it is active in the simulation.
None currently. For such a query the Python script would have to iterate through each cell group in the HDF5 file and check whether the timestep being queried for is within the range that the cell is active for.
@schmie pointed out this project which seems to do the same as we are trying to do here.
After discussing with Mayeul yesterday, we have decided on the following: The output format will use H5MD as far as possible where it makes sense to do so. The only violation of the specification is that the author subgroup within the top-level h5md group is omitted as HemeLB does not know about the user running the simulation. The file format will look like this (following the same notation as the H5MD specification):
H5MD root
\-- h5md
| +-- version: Integer[2]
| \-- creator
| +-- name: String[]
| +-- version: String[]
\-- redbloodcells
| \-- <UUID>
| | \-- box (may be hard link to box group in /redbloodcells if box is the same for all RBCs)
| | | +-- origin: Float[3]
| | | +-- extents: Float[6]
| | \-- moduli (may be hard link to moduli group in /redbloodcells if moduli is the same for all RBCs)
| | | +-- bending: Float
| | | +-- surface: Float
| | | +-- volume: Float
| | | +-- dilation: Float
| | | +-- strain: Float
| | \-- template: (hard link to template subgroup under /templates)
| | \-- barycentre
| | | \-- step: Integer[variable]
| | | \-- time: Float[variable]
| | | \-- value: Float[variable][D]
| | \-- mesh
| | \-- step: Integer[variable]
| | \-- time: Float[variable]
| | \-- value: Float[variable][N][D]
\-- templates
\-- <UUID>
\-- vertices: Float[N][D]
\-- facets: Integer[N][D]