hdf5
hdf5 copied to clipboard
Something is still introspecting `sizeof (long double)` at compile time, breaking macOS Universal Binary builds
Even after I made various fixes, I suspect something is still introspecting sizeof (long double) at compile time.
The problem occurs when I build HDF5 as a Universal Binary on macOS (that is, for both x86_64 and arm64 architectures). On x86_64 sizeof (long double) is 16. On arm64 it is 8. If I build on and Intel Mac, then run the resulting library on a arm64 Mac, it fails at runtime here:
https://github.com/HDFGroup/hdf5/blob/3707675c0376e3866bc6eb7b926c9e7e2f9b3bec/src/H5Tconv.c#L735-L736
which evaluates to 4 != 4 || 16 != 8. i.e. dt->shared->size is 16 when it should be 8. I suspect this is because the machine that built the library is x86_64 where long double is 16.
So far I haven't found where this could be happening... could anyone help point me where it might be coming from?
Thanks.
I made some progress. H5detect creates a file H5Tinit.c and its contents differ based on build machine introspection.
I made some progress.
H5detectcreates a fileH5Tinit.cand its contents differ based on build machine introspection.
Yes, that is its purpose. :-)
Indeed. :)
So I guess there are two ideas:
-
Re-evalutate the whole idea of introspecting the build machine. Such introspection of course makes cross-compilation hard/impossible (and Universal Binaries are basically cross-compiling). Perhaps build-time introspection can be replaced with compile-time instead.
-
Detect that there are two architectures listed in
CMAKE_OSX_ARCHITECTURESand create twoH5Tinit.cfiles. Or similarly create a singleH5Tinit.cthat has preprocessor switches that give a final result like:
#if __x86_64__
dt->shared->type = H5T_FLOAT;
dt->shared->size = 16;
#elif __arm64__
dt->shared->type = H5T_FLOAT;
dt->shared->size = 8;
#endif
My feeling is that 1 would be better, but 2 would be easier. Thoughts?
Is H5detect the only thing that introspects the building machine?
As a test, I modified H5Tinit.c by hand to have the #if / #elif as above, and now everything seems to be finally working for me as a Universal Binary. So this at least seems to be my last problem!
Sorry for the delay, this has been a tough week. :-/
H5detect is indeed the only introspection tool that HDF5 runs. We do support cross-compiling on other systems, usually by running H5detect by hand and then including the generated H5Tinit.c in the build. I definitely would like to avoid putting #ifdef's into H5Tinit.c, so we should explore something else, like how the introspection is performed. I don't think that compile-time introspection will perform well, but it might be OK, with some judicious use of function pointers, etc.
We do support cross-compiling on other systems, usually by running H5detect by hand and then including the generated H5Tinit.c in the build.
That's essentially what I've done for the moment. Except in the case of Universal Binaries, it's not one H5Tinit.c file but two. I basically took the two of them, and pasted them together into one file with this structure:
#if __x86_64__
contents of H5Tinit.c for Intel
#elif __arm64__
contents of H5Tinit.c for ARM
#endif
(The files are identical except for the long double parts.)
H5detect is indeed the only introspection tool that HDF5 runs.
OK, good. And indeed I haven't found any other issues after hacking past H5Tinit.c.
I don't think that compile-time introspection will perform well, but it might be OK, with some judicious use of function pointers, etc.
Did you mean 'run-time' in that sentence? I'm not proposing run-time checks, but rather compile-time. Something analogous to parts of https://github.com/HDFGroup/hdf5/pull/470/commits/9403e7d47d1e2878d6a4a1a9b7b84caa6df3ab8f ex: why build an executable to calculate H5_SIZEOF_UINT32_T when you can just use sizeof(uint32_t) directly and let the compiler evaluate it at compile-time? Maybe something similar could be done to eliminate the need for H5Tinit.c entirely...
@qkoziol any thoughts over the months how to deal with this?
@qkoziol is it possible to contract HDFGroup to fix this? i.e. I'm willing to pay, depending on price of course.
@seanm I'll bring this up with developers. No payment is needed but good ideas are :-)
@epourmal thanks!
@byrnHDF I just checked, and __aarch64__ is also defined on Mac (as is __arm64__), so that code in H5pubconf.h.in is fine and won't benefit from any __aarch64__ to __arm64__ change).
It's been months since I looked at this in detail, but re-reading things here, the issue is more with the H5Tinit.c file created by H5detect. H5detect assumes there is one answer for questions like "how big is long double?" and it puts that one answer into H5Tinit.c.
On Mon, Oct 25, 2021 at 08:40:06AM -0700, Sean McBride wrote:
@epourmal thanks!
@byrnHDF I just checked, and
__aarch64__is also defined on Mac (as is__arm64__), so that code inH5pubconf.h.inis fine and won't benefit from any__aarch64__to__arm64__change).It's been months since I looked at this in detail, but re-reading things here, the issue is more with the
H5Tinit.cfile created byH5detect.H5detectassumes there is one answer for questions like "how big islong double?" and it puts that one answer intoH5Tinit.c.
Is there any reason not to perform the type-layout detection during library initialization instead of during the build? In that way detection will be guaranteed to run on the target architecture.
H5detect does not appear to derive any information that cannot be found out during target configuration (endianness), compilation (alignment), or else from processor manuals (floating-point storage layout). The run-time detection may be useful as a fallback, but it's risky because it may invoke undefined behavior.
Dave
Is there any reason not to perform the type-layout detection during library initialization instead of during the build? In that way detection will be guaranteed to run on the target architecture.
As you say, that should work. I guess an argument against it is that it would be doing at runtime something that can be known at compile time. But as it would be run once, it's not such a waste.
H5detect does not appear to derive any information that cannot be found out during target configuration (endianness), compilation (alignment), or else from processor manuals (floating-point storage layout). The run-time detection may be useful as a fallback, but it's risky because it may invoke undefined behavior.
Indeed I recall H5detect performing undefined behaviour as flagged by UBSan. IMHO, ideally, H5detect should just go away.
Happy New Year all!
Any further thoughts on this issue?
On Thu, Jan 06, 2022 at 08:55:40AM -0800, Sean McBride wrote:
Happy New Year all!
Any further thoughts on this issue?
I think that h5detect should be replaced entirely, as you mentioned last year. In the New C99 Order, type size and natural alignment should be discoverable at compile time. The location and width of floating-point fields (sign, mantissa, exponent) for each target architecture can be derived from processor manuals and added to a table that's selected at configure/compile time. Endianness can be detected at configure/compile time, too. What am I forgetting?
Dave
I think that h5detect should be replaced entirely
Agreed. Who can do this? I'm (well, my employer is) still willing to contribute some $ to get it done.
In the New C99 Order, type size and natural alignment should be discoverable at compile time. The location and width of floating-point fields (sign, mantissa, exponent) for each target architecture can be derived from processor manuals and added to a table that's selected at configure/compile time. Endianness can be detected at configure/compile time, too. What am I forgetting?
That sounds like everything.
This may not work for cross-compiles+cross-execution environment situations, where the build system is different from the targeted runtime system. This is the benefit of h5detect - you are not just cross-compiling, but you are actually running the binary in the environment where the application will run.
For example, h5detect can be built with a cross-compile, but then must be submitted as an actual job to the HPC job scheduler on some systems, to accurately detect the environment where applications execute. It's not uncommon to have HPC systems where the "login nodes" are different from the "compute nodes".
If there was a reliable way to be certain that this continues working, then I'd be happy to remove h5detect also. :-)
As @gnuoyd says, this could be done at runtime, within an application, but the extra overhead of executing it each time would be frustrating for applications to accept, since it could be done once, at configure time.
As @gnuoyd says, this could be done at runtime, within an application, but the extra overhead of executing it each time would be frustrating for applications to accept, since it could be done once, at configure time.
It would take a trivial amount of runtime to determine such things, wouldn't it?
On Thu, Jan 06, 2022 at 09:22:17AM -0800, Quincey Koziol wrote:
This may not work for cross-compiles+cross-execution environment situations, where the build system is different from the targeted runtime system. This is the benefit of h5detect - you are not just cross-compiling, but you are actually running the binary in the environment where the application will run.
It's not necessary to run h5detect. It is necessary to cross-compile for the right target.
If the floating-point format is XYZ in the runtime environment, and I cross-compiled for a target that uses floating-point format PQR, then I chose the wrong target. If the runtime environment does not support an instruction that my binary uses, then I cross-compiled for the wrong target.
Dave
No - the execution environment (i.e. what happens when a floating-point exception occurs, etc) can be different.
On Thu, Jan 06, 2022 at 10:56:36AM -0800, Quincey Koziol wrote:
No - the execution environment (i.e. what happens when a floating-point exception occurs, etc) can be different.
Please give a concrete example.
Dave
I'm inclined to agree with @gnuoyd. It seems to me whatever compiler you are using needs to know intimate details of what it's compiling for, whether cross compiling or not.
But to bring this back to the core issue... fixing the sizeof(long double) case would be sufficient for me. It seems to be the only thing different between Intel vs ARM Mac. Getting rid of h5detect entirely would be a bonus, but not necessary.
@qkoziol can you think of any example where a compiler doesn't know, at compile-time, sizeof(long double) for the target CPU?
Please give a concrete example.
Actually, this issue needs to work the opposite way - the new way needs to prove that it will work in all existing cases. The old way doesn't need to prove anything, it already works correctly for all existing cases.
If you'd like to find out what those are, @brtnfld, Suren Byna (@ LBNL) and the DOE weapons lab people (like Lee Ward, Mark Miller, et al) would be the people to talk to about cases where cross-execution of h5detect is required.
@qkoziol can you think of any example where a compiler doesn't know, at compile-time,
sizeof(long double)for the target CPU?
Agree here, yes.
The old way doesn't need to prove anything, it already works correctly for all existing cases.
Uh, no, the old way does not work for Universal Binary builds on macOS, which is the topic of this ticket :)
Agree here, yes.
OK great. (Indeed I don't see how a compiler could generate assembly for long double use if it didn't know how big it was!)
So let's focus on this then. Can the subset of configure-time introspection that looks at the size of long double be excised? (Removal of other configure-time introspection can be the topic of some other ticket.)
Agree here, yes.
OK great. (Indeed I don't see how a compiler could generate assembly for
long doubleuse if it didn't know how big it was!)So let's focus on this then. Can the subset of configure-time introspection that looks at the size of
long doublebe excised? (Removal of other configure-time introspection can be the topic of some other ticket.)
Agree - focus on fixing this problem separately from the h5detect issue.
On Thu, Jan 06, 2022 at 10:56:36AM -0800, Quincey Koziol wrote:
No - the execution environment (i.e. what happens when a floating-point exception occurs, etc) can be different.
The configuration of floating-point exceptions does not appear to make a material difference to the H5Tinit.c that is generated.
I see h5detect trying to determine the machine's idea of each type's native alignment using experimental accesses. The behavior of those accesses is undefined in C: it's possible for a C compiler just to elide them or to do other nasty things. So the experimental results are not, strictly speaking, even reliable.
For the sake of discussion, call the type alignment found out by experiment the "machine alignment." The machine alignment is recorded in H5Tinit.c. Here and there the library uses a type's machine alignment to decide whether to load/store an unaligned instance "directly" or to bounce it through an aligned buffer. E.g., in macro H5T_CONV in H5Tconv.c:
/* Is alignment required for source or dest? */
\
s_mv = H5T_NATIVE_##STYPE##_ALIGN_g > 1 && \
((size_t)buf % H5T_NATIVE_##STYPE##_ALIGN_g || \
/* Cray */ ((size_t)((ST *)buf) != (size_t)buf) || \
(size_t)s_stride % H5T_NATIVE_##STYPE##_ALIGN_g); \
d_mv = H5T_NATIVE_##DTYPE##_ALIGN_g > 1 && \
((size_t)buf % H5T_NATIVE_##DTYPE##_ALIGN_g || \
/* Cray */ ((size_t)((DT *)buf) != (size_t)buf) || \
(size_t)d_stride % H5T_NATIVE_##DTYPE##_ALIGN_g); \
Thsoe unaligned accesses, too, invoke undefined behavior.
Are there any grounds for bouncing each access through an aligned buffer using memcpy---not H5MM_memcpy---and letting the compiler sweat the details.
Dave
I have been trying to make arm64 wheels for pytables/h5py. To do that I initially made fat wheels by compiling hdf5 fat binaries but ran into issues.
Instead I focused on cross-compiling just for arm64. I was unable to do that neither with camke or configure and neither for 1.12.1 or 1.13.1. I got various errors, but basically it came down to something like in this issue where it was trying to run arm64 code of x86 or it (cmake) recognized it was cross-compiling and that it was not going to be able to run so it errored out.
I was only able to cross-compile after applying these patches from conda-forge and setting the environment variables as shown there. It would be great if these could be applied somehow for 1.14!
So at least for CMake, those patches just codify the process for cross-compiling. I will create a doc based on the small section VI in the release_docs/README_HPC file.
1. HDF5_USE_PREGEN. This option, along with the HDF5_USE_PREGEN_DIR CMake variable would allow the use of an appropriate H5Tinit.c file with type information generated on a compute node to be used when cross compiling for those compute nodes.
I am thinking of creating a folder under config named "crosscompile" with pre-generated H5Tinit.c files.
I would think we could figure out how to get autotools/configure to the same.
There are a few PRs coming that should help with cross-compiling:
https://github.com/HDFGroup/hdf5/pull/1400
https://github.com/HDFGroup/hdf5/pull/1426