openPMD-standard
openPMD-standard copied to clipboard
Complex Numbers: define touple of two as integral?
It might be useful to allow touples of two numbers to be defined as integral type for complex components of records.
alternatively, one would need a consistent naming convention for the components.
numpy complex types - how does h5py (FAQ: HDF5 struct) and the official HDF5 standard handle these?
@ax3l
Careful: HDF5 Fortran support for compound data types is pretty bad.
Additionally other formats may not support directly complex numbers at all. To maximize portability, I would vote for the standard mandating complex storage as real and imaginary datasets or amplitude and phase datasets. If using amplitude and phase, should the conversion factor to SI convert to radians or radians/2pi?
Side note: ADIOS2 has first-class complex support.
To maximize portability, I would vote for the standard mandating complex storage as real and imaginary datasets or amplitude and phase datasets.
I am not entirely decided on that yet, maybe some formats just don't support complex and I am really intrigued to push those to improve instead. We don't have "required" complex records or attributes yet besides the geometry thetaMode
, where we split in real and imaginary so far. Amplitude and phase is a good point as well... actually phase normalized to -0.5:0.5 would be best from an IEEE-floating point of view ;-)
I am not entirely decided on that yet, maybe some formats just don't support complex and I am really intrigued to push those to improve instead.
I'm not sure what you mean by "intrigued". One way to handle this would be to say that if the format supports complex numbers than that should be used and if not then use real/imaginary or amp/phase sub datasets.
We don't have "required" complex records or attributes yet besides the geometry thetaMode
Actually for the beam physics extension there is the photon polarization which is complex.
@ax3l And hdf5 does not directly support complex numbers so this is a portability problem for the files I will be creating. So how about this for the standard:
If the format supports complex numbers as a native type then use the native type otherwise store the complex numbers in sub-groups as real and imaginary with labels real
andimaginary
, or amplitude and phase with labels amplitude
andphase
. If using phase, phase numbers when multiplied by unitSI
should give radians.
I would not mandate that the phase be in any specific interval like [-pi, pi] since the integer part of the phase may be important.
Question: If the format supports complex numbers should a writer have the option to not use the native complex and instead store with re/im or amp/phase sub-groups?
Note: Since the beam physics extension is using complex numbers in more than one place, if a convention for complex numbers does not get into the main standard, then a convention will have to be put into the extension.
It looks like pytables and h5py both use 'r' and 'i' as names for the compound dataset. Actually, FORTRAN has had a complex datatype since at least FORTRAN 77. I don't like also specifying phase and amplitude - the user should convert.
It looks like pytables and h5py both use 'r' and 'i' as names for the compound dataset.
Yes using 'r' and 'i' to be compatible with pytables and h5py would be a good idea.
I don't like also specifying phase and amplitude - the user should convert.
Fine with me.
Yes, we should also stay by default what C11 and C++ do: a simple POD datatype of two float
, double
and long double
of the order real+complex. We can easily add this to the C++ and Python frontend APIs. Fortran seams to do the same layout.
The problem is less how to layout the complex but more how to describe it on the low level in file formats without too much introduction of conventions we have to carry. ADIOS' .bp
format supports complex which covers us for the data "round-trip" (writing it will allow us to identify it again). For HDF5 we have to decide for one of the various annotations because they did not standardize it... I guess will just do whatever h5py
does since it is the most popular implementation and as Chris said, pytables
seams to follow the same convention.
The Fortran issue with querying complex numbers mentioned above remains, have you by coincidence tried reading/writing a h5py
complex from/to Fortran?
Chris has been using h5py. I have not.
@ChristopherMayes did writing a data set and/or attribute with complex numbers in h5py
and then identifying it as complex and reading it via Fortran - and vice versa - work for you? Just looking for someone with experience in this so we can find a programmatically compatible solution.
As far as C++ and Python is concerned, we will just add it here: https://github.com/openPMD/openPMD-api/issues/203 and relax potential wording in the standard.
How would this new standard definition effect ℂ3 datasets as used in PIConGPU?
@PrometheusPi
How would this new standard definition effect ℂ3 datasets as used in PIConGPU?
Here there are three complex numbers (x, y, z). PIConGPU uses a single level to store all 6 (real and imaginary) arrays. The convention for openPMD would be to have x, y, and z groups and then store "r" and "i" arrays as datasets of these groups. That is there would be two levels.
@ax3l @ChristopherMayes Since the beam physics extension is actively being used, and since the question of complex storage has come up, I added a note to the extension on how complex numbers should be stored in line with this discussion. The idea is that when the 2.0 version is finalized, this text can be transferred to the base standard.
For a complex array Z, having a group Z/
with arrays re
and im
or r
and i
seems consistent with the rest of the standard.
@ChristopherMayes So right now the note I wrote mandates "r" and "i" to be consistent with h5py.
How would this new standard definition effect ℂ3 datasets as used in PIConGPU?
You could then store them as regular record with record components as ℂ. Note to the reader: this is an in situ plugin of PIConGPU using this so far, not the main I/O (plugin) routines.
having a group Z/ with arrays re and im or r and i seems consistent with the rest of the standard.
I think generally extensions can do this but it makes parsing by a pure base-standard reader more complicated. My idea is actually, like the title of the issue, to just allow integral complex types (just like float) since I think HDF5 and ADIOS support this well for pairs of single and double precision. Yes, compatible to the C11 complex (, h5py
), C++ and Fortran standard types for complex, one would only mandate the order and that both are stored as plain-old-data (POD) struct in r-i order (no padding, aliasing, etc.).
That's at least something that one can implement in base standard readers and writers and that's exactly what h5py
is doing as well. Although extensions can play with sub-groups as they like, adding another group for such a fundamental type adds complexity to the base standard that might not be needed with an integral type and would complicate further extensions, such as MR storage groups later on.
This is how this is handled in HDF5 by convention of h5py
:
Example code:
import h5py as h5
import numpy as np
x = np.array([1.+2j, 3.+4j, 5.+6j], np.cdouble)
x
# array([1.+2.j, 3.+4.j, 5.+6.j])
x.dtype
# dtype('complex128')
f = h5.File('complex.h5', 'w')
f['/mydata'] = x
f.close()
Creates a H5T_COMPOUND
with r
and i
label:
$ h5ls complex.h5
mydata Dataset {3}
$ h5ls -d -r complex.h5
/ Group
/mydata Dataset {3}
Data:
(0) {1, 2}, {3, 4}, {5, 6}
$ h5dump complex.h5
HDF5 "complex.h5" {
GROUP "/" {
DATASET "mydata" {
DATATYPE H5T_COMPOUND {
H5T_IEEE_F64LE "r";
H5T_IEEE_F64LE "i";
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): {
1,
2
},
(1): {
3,
4
},
(2): {
5,
6
}
}
}
}
}
ADIOS1 and ADIOS2 have native type support for single and double complex.
@ax3l So how would you word this in the standard? What happens if someone is using a format that does not support complex nor compound types?
Also with your example above do you have non h5py code that can read this in. I would be curious to see this (or how to write this using non-h5py code).
@ChristopherMayes Have you tried to read in grid field files with complex fields generated by Bmad using h5py?
So how would you word this in the standard? What happens if someone is using a format that does not support complex nor compound types?
I would propose to just allow complex floating point types as integral types.
File formats that do not support complex integral types simply cannot store complex types, I would say (aka some file formats are not suited for all scientific data). We intentionally do not try to re-invent file formats and a complex is literally just a value[2]
or equivalently struct{ T r; T i}
, so one has quite some leverage to be creative in implementations.
HDF5, ADIOS1, and ADIOS2 support at least complex<float>
and complex<double>
which is a very good coverage. And for JSON we just write them out as integral pair type.
Also with your example above do you have non h5py code that can read this in. I would be curious to see this (or how to write this using non-h5py code).
I am trying to draft this as an example implementation in openPMD-api right now. It's similar to how we (and h5py
) already handle bools in HDF5: here one checks it's a compound type and the label names for the components are as defined above. (Note: h5py bools are by similar convention labeled enums.)
@ax3l
We intentionally do not try to re-invent file formats and a complex is literally just a value[2] or equivalently struct{ T r; T i}, so one has quite some leverage to be creative in implementations.
Actually I would say that the purpose of any standard is to improve portability at the expense of creativity.
So how about this for the standard: 1) If a file format supports complex that is to be used. 2) If a file format does not support complex but supports structs then store complex as a struct with component labels "r" and "i". 3) If a file format does not support complex nor structs then store complex numbers in two sub-datasets labeled "r" and "i".
HDF5, ADIOS1, and ADIOS2 support at least complex
and complex which is a very good coverage.
Hmmm. As far as I know HDF5 does not support complex numbers. Do you have a reference for this?
Yes, there ways to express complex numbers in HDF5 which are broadly enough supported to be used (aka h5py
). I'll show you an example implementation with the HDF5 C api in https://github.com/openPMD/openPMD-api/pull/639, but I will need a bit more time to work on this and just started yesterday. It's literally just the creation of a HDF5 compound type of two identical floats, with two labels r
and i
as shown here. Will look a bit like this.
Update: HDF5 types added (look for isolated commit "HDF5: Complex Support").
Why not to specify fallbacks: it complicates a bit the parsing which makes auto-upgrades harder in the future and slows us down. We do not really have a format right now that does not support complex numbers nor a workflow that is blocked by using a non-scientific data format that would need that. Therefore, better leave fallbacks unstandardized, because we should not standardize things we do not already have concrete, working examples for. KISS principle.
@ax3l
Actually it does not look like the h5py way is broadly supported. For example see: https://mathematica.stackexchange.com/questions/63983/hdf5-and-complex-numbers
HDF5 does not directly support complex numbers. Programs that do seem to be able to export complex numbers (like armadillo) to HDF5 will in reality split them into real and imaginary part and use their own non-standard convention for storing these. This means that while they can sometimes read back their own data, there is no compatibility between different software regarding storing complex values in HDF5.
So, to be portable, the standard needs to lay out some guidelines on how complex numbers are to be stored. How would you word this in the standard? How about something like: 1) If a file format supports complex that is to be used. 2) For HDF5, be compatible with h5py and store complex as a struct with component labels "r" and "i". 3) For other formats contact the openPMD maintainers. :-).
Another thing is I don't see how the fallbacks complicates parsing. Can you provide an example?
@ax3l @ChristopherMayes
Another issue I thought of is that in the case where, say, the complex component is zero, If you follow the h5py way then you double the storage size needed as compared to using two sub-datasets. Is this penalty really worth it?
How about something like: 1) If a file format supports complex that is to be used. 2) For HDF5, be compatible with h5py and store complex as a struct with component labels "r" and "i". 3) For other formats contact the openPMD maintainers. :-)
I think generally we can add a note for file formats. Such as "HDF5 bool and complex type representations must be h5py compatible."
For other formats contact the openPMD maintainers. :-).
I think that's generally the easiest way: for other formats, contact the reference implementation in openPMD-api
and/or open a public issue.
Another thing is I don't see how the fallbacks complicates parsing. Can you provide an example?
If we implement sub-groups then we have to implement this sub-group parsing in all readers and upgrade scripts, which I try to avoid unless we have a strong use case.
Another issue I thought of is that in the case where, say, the complex component is zero, If you follow the h5py way then you double the storage size needed as compared to using two sub-datasets. Is this penalty really worth it?
I think there are several solutions to that without a pre-mature optimization that complicates the standard. First of all, if a user knows an openPMD record component is purely real (or imaginary), then they can take a real data type and name the component accordingly. Second, every light-weight compression that can be opaquely added on the file format layer (HDF5, ADIOS1/2) will strongly compress zeroed-out components. This falls under pre-mature optimization, I think.
(Sorry, clicked wrong button.)
@ax3l @ChristopherMayes
OK but let me write something first since I have a new proposal...
@ax3l @ChristopherMayes
Upon further reflection I propose this for the standard:
Complex data types are to be stored in a group with datasets (or constant record components if the values are constant) labeled "re" for the real part and "im" for the imaginary part. Using "re" and "im" storage is mandated even if the storage format supports native complex numbers.
Motivation. The advantages are:
- There is no degradation in portability.
- Simplicity.
- Using "re" and "im" simplifies programming since reading or writing complex data is equivalent to reading or writing two real data arrays. Since reading or writing real data must be implemented in any case, the extra work in implementing reading/writing complex is just a few lines (essentially two lines for reading and two lines for writing).
- Data storage is minimized if all the real and/or imaginary parts are the same.
- We don't have to worry about new formats and how to handle them.
- No fallbacks to complicate things.
- No worrying about if there is opaque storage size optimization of complex data types.
As reasoned above, we do not have a file format that does not support complex in our production workflows. Due to the reasons above (maintenance, reference implementations, upgrades, etc.) I would not standardize this until someone has a concrete example workflow where this is needed.
You are of course welcome to provided PRs to the reference implementations (viewer, API, validator scripts, upgrader, VisIt, ...) with concrete challenges that cannot be addressed otherwise and we can move the priority up. But I do not see the need right now with the tools and file formats we are using. I think this is a file format task and we intentionally rely on file formats for such things, because it's the scalable and maintainable thing to do.
We can have a VC in new year if you like to talk more about this.
As reasoned above, we do not have a file format that does not support complex in our production workflows.
This is not true. Let me quote again from https://mathematica.stackexchange.com/questions/63983/hdf5-and-complex-numbers
HDF5 does not directly support complex numbers. Programs that do seem to be able to export complex numbers (like armadillo) to HDF5 will in reality split them into real and imaginary part and use their own non-standard convention for storing these. This means that while they can sometimes read back their own data, there is no compatibility between different software regarding storing complex values in HDF5.
In terms of maintenance, reference implementations, upgrades, etc., my proposal is actually superior to using a struct since there is less code to write.
I agree with the @DavidSagan 's post above for labeling 're' and 'im' datasets.