hdf5 icon indicating copy to clipboard operation
hdf5 copied to clipboard

Something is still introspecting `sizeof (long double)` at compile time, breaking macOS Universal Binary builds

Open seanm opened this issue 4 years ago • 45 comments

Even after I made various fixes, I suspect something is still introspecting sizeof (long double) at compile time.

The problem occurs when I build HDF5 as a Universal Binary on macOS (that is, for both x86_64 and arm64 architectures). On x86_64 sizeof (long double) is 16. On arm64 it is 8. If I build on and Intel Mac, then run the resulting library on a arm64 Mac, it fails at runtime here:

https://github.com/HDFGroup/hdf5/blob/3707675c0376e3866bc6eb7b926c9e7e2f9b3bec/src/H5Tconv.c#L735-L736

which evaluates to 4 != 4 || 16 != 8. i.e. dt->shared->size is 16 when it should be 8. I suspect this is because the machine that built the library is x86_64 where long double is 16.

So far I haven't found where this could be happening... could anyone help point me where it might be coming from?

Thanks.

seanm avatar Apr 21 '21 17:04 seanm

I made some progress. H5detect creates a file H5Tinit.c and its contents differ based on build machine introspection.

seanm avatar Apr 21 '21 18:04 seanm

I made some progress. H5detect creates a file H5Tinit.c and its contents differ based on build machine introspection.

Yes, that is its purpose. :-)

qkoziol avatar Apr 22 '21 19:04 qkoziol

Indeed. :)

So I guess there are two ideas:

  1. Re-evalutate the whole idea of introspecting the build machine. Such introspection of course makes cross-compilation hard/impossible (and Universal Binaries are basically cross-compiling). Perhaps build-time introspection can be replaced with compile-time instead.

  2. Detect that there are two architectures listed in CMAKE_OSX_ARCHITECTURES and create two H5Tinit.c files. Or similarly create a single H5Tinit.c that has preprocessor switches that give a final result like:

#if __x86_64__
    dt->shared->type = H5T_FLOAT;
    dt->shared->size = 16;
#elif __arm64__
    dt->shared->type = H5T_FLOAT;
    dt->shared->size = 8;
#endif

My feeling is that 1 would be better, but 2 would be easier. Thoughts?

seanm avatar Apr 22 '21 20:04 seanm

Is H5detect the only thing that introspects the building machine?

As a test, I modified H5Tinit.c by hand to have the #if / #elif as above, and now everything seems to be finally working for me as a Universal Binary. So this at least seems to be my last problem!

seanm avatar Apr 27 '21 22:04 seanm

Sorry for the delay, this has been a tough week. :-/

H5detect is indeed the only introspection tool that HDF5 runs. We do support cross-compiling on other systems, usually by running H5detect by hand and then including the generated H5Tinit.c in the build. I definitely would like to avoid putting #ifdef's into H5Tinit.c, so we should explore something else, like how the introspection is performed. I don't think that compile-time introspection will perform well, but it might be OK, with some judicious use of function pointers, etc.

qkoziol avatar Apr 30 '21 22:04 qkoziol

We do support cross-compiling on other systems, usually by running H5detect by hand and then including the generated H5Tinit.c in the build.

That's essentially what I've done for the moment. Except in the case of Universal Binaries, it's not one H5Tinit.c file but two. I basically took the two of them, and pasted them together into one file with this structure:

#if __x86_64__
    contents of H5Tinit.c for Intel
#elif __arm64__
    contents of H5Tinit.c for ARM
#endif

(The files are identical except for the long double parts.)

H5detect is indeed the only introspection tool that HDF5 runs.

OK, good. And indeed I haven't found any other issues after hacking past H5Tinit.c.

I don't think that compile-time introspection will perform well, but it might be OK, with some judicious use of function pointers, etc.

Did you mean 'run-time' in that sentence? I'm not proposing run-time checks, but rather compile-time. Something analogous to parts of https://github.com/HDFGroup/hdf5/pull/470/commits/9403e7d47d1e2878d6a4a1a9b7b84caa6df3ab8f ex: why build an executable to calculate H5_SIZEOF_UINT32_T when you can just use sizeof(uint32_t) directly and let the compiler evaluate it at compile-time? Maybe something similar could be done to eliminate the need for H5Tinit.c entirely...

seanm avatar May 03 '21 18:05 seanm

@qkoziol any thoughts over the months how to deal with this?

seanm avatar Aug 13 '21 18:08 seanm

@qkoziol is it possible to contract HDFGroup to fix this? i.e. I'm willing to pay, depending on price of course.

seanm avatar Oct 01 '21 16:10 seanm

@seanm I'll bring this up with developers. No payment is needed but good ideas are :-)

epourmal avatar Oct 25 '21 12:10 epourmal

@epourmal thanks!

@byrnHDF I just checked, and __aarch64__ is also defined on Mac (as is __arm64__), so that code in H5pubconf.h.in is fine and won't benefit from any __aarch64__ to __arm64__ change).

It's been months since I looked at this in detail, but re-reading things here, the issue is more with the H5Tinit.c file created by H5detect. H5detect assumes there is one answer for questions like "how big is long double?" and it puts that one answer into H5Tinit.c.

seanm avatar Oct 25 '21 15:10 seanm

On Mon, Oct 25, 2021 at 08:40:06AM -0700, Sean McBride wrote:

@epourmal thanks!

@byrnHDF I just checked, and __aarch64__ is also defined on Mac (as is __arm64__), so that code in H5pubconf.h.in is fine and won't benefit from any __aarch64__ to __arm64__ change).

It's been months since I looked at this in detail, but re-reading things here, the issue is more with the H5Tinit.c file created by H5detect. H5detect assumes there is one answer for questions like "how big is long double?" and it puts that one answer into H5Tinit.c.

Is there any reason not to perform the type-layout detection during library initialization instead of during the build? In that way detection will be guaranteed to run on the target architecture.

H5detect does not appear to derive any information that cannot be found out during target configuration (endianness), compilation (alignment), or else from processor manuals (floating-point storage layout). The run-time detection may be useful as a fallback, but it's risky because it may invoke undefined behavior.

Dave

gnuoyd avatar Oct 25 '21 17:10 gnuoyd

Is there any reason not to perform the type-layout detection during library initialization instead of during the build? In that way detection will be guaranteed to run on the target architecture.

As you say, that should work. I guess an argument against it is that it would be doing at runtime something that can be known at compile time. But as it would be run once, it's not such a waste.

H5detect does not appear to derive any information that cannot be found out during target configuration (endianness), compilation (alignment), or else from processor manuals (floating-point storage layout). The run-time detection may be useful as a fallback, but it's risky because it may invoke undefined behavior.

Indeed I recall H5detect performing undefined behaviour as flagged by UBSan. IMHO, ideally, H5detect should just go away.

seanm avatar Oct 25 '21 17:10 seanm

Happy New Year all!

Any further thoughts on this issue?

seanm avatar Jan 06 '22 16:01 seanm

On Thu, Jan 06, 2022 at 08:55:40AM -0800, Sean McBride wrote:

Happy New Year all!

Any further thoughts on this issue?

I think that h5detect should be replaced entirely, as you mentioned last year. In the New C99 Order, type size and natural alignment should be discoverable at compile time. The location and width of floating-point fields (sign, mantissa, exponent) for each target architecture can be derived from processor manuals and added to a table that's selected at configure/compile time. Endianness can be detected at configure/compile time, too. What am I forgetting?

Dave

gnuoyd avatar Jan 06 '22 17:01 gnuoyd

I think that h5detect should be replaced entirely

Agreed. Who can do this? I'm (well, my employer is) still willing to contribute some $ to get it done.

In the New C99 Order, type size and natural alignment should be discoverable at compile time. The location and width of floating-point fields (sign, mantissa, exponent) for each target architecture can be derived from processor manuals and added to a table that's selected at configure/compile time. Endianness can be detected at configure/compile time, too. What am I forgetting?

That sounds like everything.

seanm avatar Jan 06 '22 17:01 seanm

This may not work for cross-compiles+cross-execution environment situations, where the build system is different from the targeted runtime system. This is the benefit of h5detect - you are not just cross-compiling, but you are actually running the binary in the environment where the application will run.

For example, h5detect can be built with a cross-compile, but then must be submitted as an actual job to the HPC job scheduler on some systems, to accurately detect the environment where applications execute. It's not uncommon to have HPC systems where the "login nodes" are different from the "compute nodes".

If there was a reliable way to be certain that this continues working, then I'd be happy to remove h5detect also. :-)

qkoziol avatar Jan 06 '22 17:01 qkoziol

As @gnuoyd says, this could be done at runtime, within an application, but the extra overhead of executing it each time would be frustrating for applications to accept, since it could be done once, at configure time.

qkoziol avatar Jan 06 '22 17:01 qkoziol

As @gnuoyd says, this could be done at runtime, within an application, but the extra overhead of executing it each time would be frustrating for applications to accept, since it could be done once, at configure time.

It would take a trivial amount of runtime to determine such things, wouldn't it?

seanm avatar Jan 06 '22 17:01 seanm

On Thu, Jan 06, 2022 at 09:22:17AM -0800, Quincey Koziol wrote:

This may not work for cross-compiles+cross-execution environment situations, where the build system is different from the targeted runtime system. This is the benefit of h5detect - you are not just cross-compiling, but you are actually running the binary in the environment where the application will run.

It's not necessary to run h5detect. It is necessary to cross-compile for the right target.

If the floating-point format is XYZ in the runtime environment, and I cross-compiled for a target that uses floating-point format PQR, then I chose the wrong target. If the runtime environment does not support an instruction that my binary uses, then I cross-compiled for the wrong target.

Dave

gnuoyd avatar Jan 06 '22 18:01 gnuoyd

No - the execution environment (i.e. what happens when a floating-point exception occurs, etc) can be different.

qkoziol avatar Jan 06 '22 18:01 qkoziol

On Thu, Jan 06, 2022 at 10:56:36AM -0800, Quincey Koziol wrote:

No - the execution environment (i.e. what happens when a floating-point exception occurs, etc) can be different.

Please give a concrete example.

Dave

gnuoyd avatar Jan 06 '22 19:01 gnuoyd

I'm inclined to agree with @gnuoyd. It seems to me whatever compiler you are using needs to know intimate details of what it's compiling for, whether cross compiling or not.

But to bring this back to the core issue... fixing the sizeof(long double) case would be sufficient for me. It seems to be the only thing different between Intel vs ARM Mac. Getting rid of h5detect entirely would be a bonus, but not necessary.

@qkoziol can you think of any example where a compiler doesn't know, at compile-time, sizeof(long double) for the target CPU?

seanm avatar Jan 06 '22 19:01 seanm

Please give a concrete example.

Actually, this issue needs to work the opposite way - the new way needs to prove that it will work in all existing cases. The old way doesn't need to prove anything, it already works correctly for all existing cases.

If you'd like to find out what those are, @brtnfld, Suren Byna (@ LBNL) and the DOE weapons lab people (like Lee Ward, Mark Miller, et al) would be the people to talk to about cases where cross-execution of h5detect is required.

qkoziol avatar Jan 06 '22 20:01 qkoziol

@qkoziol can you think of any example where a compiler doesn't know, at compile-time, sizeof(long double) for the target CPU?

Agree here, yes.

qkoziol avatar Jan 06 '22 20:01 qkoziol

The old way doesn't need to prove anything, it already works correctly for all existing cases.

Uh, no, the old way does not work for Universal Binary builds on macOS, which is the topic of this ticket :)

Agree here, yes.

OK great. (Indeed I don't see how a compiler could generate assembly for long double use if it didn't know how big it was!)

So let's focus on this then. Can the subset of configure-time introspection that looks at the size of long double be excised? (Removal of other configure-time introspection can be the topic of some other ticket.)

seanm avatar Jan 06 '22 20:01 seanm

Agree here, yes.

OK great. (Indeed I don't see how a compiler could generate assembly for long double use if it didn't know how big it was!)

So let's focus on this then. Can the subset of configure-time introspection that looks at the size of long double be excised? (Removal of other configure-time introspection can be the topic of some other ticket.)

Agree - focus on fixing this problem separately from the h5detect issue.

qkoziol avatar Jan 06 '22 20:01 qkoziol

On Thu, Jan 06, 2022 at 10:56:36AM -0800, Quincey Koziol wrote:

No - the execution environment (i.e. what happens when a floating-point exception occurs, etc) can be different.

The configuration of floating-point exceptions does not appear to make a material difference to the H5Tinit.c that is generated.

I see h5detect trying to determine the machine's idea of each type's native alignment using experimental accesses. The behavior of those accesses is undefined in C: it's possible for a C compiler just to elide them or to do other nasty things. So the experimental results are not, strictly speaking, even reliable.

For the sake of discussion, call the type alignment found out by experiment the "machine alignment." The machine alignment is recorded in H5Tinit.c. Here and there the library uses a type's machine alignment to decide whether to load/store an unaligned instance "directly" or to bounce it through an aligned buffer. E.g., in macro H5T_CONV in H5Tconv.c:

                /* Is alignment required for source or dest? */ 
                         \
                s_mv = H5T_NATIVE_##STYPE##_ALIGN_g > 1 && \
                       ((size_t)buf % H5T_NATIVE_##STYPE##_ALIGN_g || \
                        /* Cray */ ((size_t)((ST *)buf) != (size_t)buf) || \
                        (size_t)s_stride % H5T_NATIVE_##STYPE##_ALIGN_g); \
                d_mv = H5T_NATIVE_##DTYPE##_ALIGN_g > 1 && \
                       ((size_t)buf % H5T_NATIVE_##DTYPE##_ALIGN_g || \
                        /* Cray */ ((size_t)((DT *)buf) != (size_t)buf) || \
                        (size_t)d_stride % H5T_NATIVE_##DTYPE##_ALIGN_g); \

Thsoe unaligned accesses, too, invoke undefined behavior.

Are there any grounds for bouncing each access through an aligned buffer using memcpy---not H5MM_memcpy---and letting the compiler sweat the details.

Dave

gnuoyd avatar Jan 06 '22 22:01 gnuoyd

I have been trying to make arm64 wheels for pytables/h5py. To do that I initially made fat wheels by compiling hdf5 fat binaries but ran into issues.

Instead I focused on cross-compiling just for arm64. I was unable to do that neither with camke or configure and neither for 1.12.1 or 1.13.1. I got various errors, but basically it came down to something like in this issue where it was trying to run arm64 code of x86 or it (cmake) recognized it was cross-compiling and that it was not going to be able to run so it errored out.

I was only able to cross-compile after applying these patches from conda-forge and setting the environment variables as shown there. It would be great if these could be applied somehow for 1.14!

matham avatar Mar 18 '22 05:03 matham

So at least for CMake, those patches just codify the process for cross-compiling. I will create a doc based on the small section VI in the release_docs/README_HPC file. 1. HDF5_USE_PREGEN. This option, along with the HDF5_USE_PREGEN_DIR CMake variable would allow the use of an appropriate H5Tinit.c file with type information generated on a compute node to be used when cross compiling for those compute nodes.

I am thinking of creating a folder under config named "crosscompile" with pre-generated H5Tinit.c files.

I would think we could figure out how to get autotools/configure to the same.

byrnHDF avatar Mar 18 '22 13:03 byrnHDF

There are a few PRs coming that should help with cross-compiling:

https://github.com/HDFGroup/hdf5/pull/1400

https://github.com/HDFGroup/hdf5/pull/1426

seanm avatar Mar 18 '22 14:03 seanm