meson Use a two step Fortran and C++ dependency scanner

This series does a bit of cleanup and optimization as well, but the main goal is to have a scanner that outputs P1689r5 compatible JSON scanning information, and then a second step that reads in multiple JSON files and produces a dyndep.

This has a couple of advantages:

the JSON format is what both MSVC and Clang (through clang-scan-deps) produce for C++ modules, which means that we can re-use the same accumulator for our internal scanner and for using MSVC and clang-scan-deps
by splitting the steps up we provide the JSON to dependees, allowing for more accurate dependency information than we can easily generate at configure time, this allows for greater parallelism

This is really about getting to the point of having reliable C++ module support for both MSVC and Clang using their provided tools (GCC has gone down a very strange path which is different than MSVC, Clang, and Gfortran), but as a clean first step

Nov 22 '23 23:11 dcbaker

The downside of two phase scanning is that you need to invoke two processes per each source file just to do the scanning. This is ... unfortunate. I'm working on a blog post/project that I hope to cause at least some sort of a tooling change. I'd like to get that out before thinking about merging this. Hopefully it only takes a few days.

Jan 01 '24 19:01 jpakkane

Here it is.

Jan 02 '24 18:01 jpakkane

I've ready both of your blog posts, both this one and the one one about the dynamic command lines. Clang does have the option you want, it's called -fprebuilt-module-paths. I started working on using that with the clean-scan-deps work I'm doing.

Second, Meson's scanner is actually broken, since it relies on hacks that are emitted at configure time (although it looks like the hack hasn't been implemented for C++ yet, and can be easily demonstrated by hacking the module test to put src9.cxx in a static library, then attempting to build gcc/modtest.p/src8.cxx, with my patches here it correctly attempts to generate src9.cxx first for it's ifc file (although with gcc 12 it runs into the "inputs can't also have inputs" problem.)) This is done by making each target linked with a orderonly dependency. I've removed the fortran hack as well in this series. But thinking about it, I'm not sure that this is even sufficient, since you could add another module to an existing source file and we wouldn't properly wait on that, so it actually needs to a full dependency.

Third, To accurately get this information and have the order be correct you must read the module outputs of all of the dependencies of a target and account for those, so the bmi (I'm going to call it that from now on) is up to date. You also need to recalculate your dynamic outputs in the even that. This is also necessary if a new module name is added. You can do this in one step, but the dyndep doesn't actually have enough information for this, so you'd have have a sideband (ie, generate json and dyndep in one step).

Having dep information on a per tu basis has advantages compared to a per target basis, at the cost of simplicity. It means that we can start more parallel compilation, and do less work for incremental builds.

So, we're basically making tradeoffs:

with the current one step approach we must make each target have a full dependency on the previous target (currently its order-only, but that isn't sufficient, and I can write a test to show that). This could create a significant bottleneck
With the per tu approach in my series we'd run more processes, but we only scan source files when they've actually changed, only accumulate the json data into a dd when the json data has changed, and get accurate, per tu information which allows for more work to be done in parallel. We also do less work on incremental builds, since we only need to scan and rebuild targets that have actually changed.

Jan 03 '24 00:01 dcbaker

I've added the necessary orderdep -> full dep changes, but I haven't gotten the tests yet. They'll require a python unittest because we're going to have to compile -> change the source -> recompile a specific target to prove that the bug is fixed, and I'm out of time for today

Jan 03 '24 00:01 dcbaker

And I just realized that there's another case that isn't handled here that has to be handled, which is recursive dependencies, and after a quick test I found that that too is broken.

Jan 03 '24 02:01 dcbaker

Clang does have the option you want, it's called -fprebuilt-module-paths. I started working on using that with the clean-scan-deps work I'm doing.

Yes, but does it have a compiler flag for "write module files, whatever their name, in directory X"? All the examples that I see do not have that and use explicit output file it is needed to avoid having to generate command line arguments during compilation. That is the biggest problem FWICT, the others we can fix and/or work around.

Second, Meson's scanner is actually broken,

I would have been amazed if it were not. :D I only implemented enough of it to get things going. Sadly they have been stagnant for a few years.

with the current one step approach we must make each target have a full dependency on the previous target

Assuming target A that depends on B, we only need to add a dependency of type "all compilations of A must wait until all compilations of B are done". You'd need a pseudo target for this but it's doable. The delay would be at most the longest compilation in A on average a fraction of that. Is that too much? Maybe. Maybe not. I'm hesitant to throw around performance estimates without actual measurements.

Jan 03 '24 16:01 jpakkane

Yes, but does it have a compiler flag for "write module files, whatever their name, in directory X"? All the examples that I see do not have that and use explicit output file it is needed to avoid having to generate command line arguments during compilation. That is the biggest problem FWICT, the others we can fix and/or work around.

I just did

echo "export module speech; export const char * say() { return "Hello, World!"l; }" > "speech.cppm"
mkdir priv
clang++ -std=c++20 -fmodule-output -c speech.cppm -o priv
ls priv

and it wrote out a test.o and a speech.pcm. Which is the sanest implementation of the big 3 (with GCC's plan being absolutely bonkers). We could pretty easily have a rule of "if the name of your file == the name of your exported module you're good to go, otherwise you must pass the cpp_module_name parameter and Meson will map that to -fmodule-output=priv/<cpp_module_name>.pcm, which is not perfect, but makes the common case obvious (foo.cpp has the module foo) and isn't too bad in the less common case.

This is much closer to what we actually want than GCC's server plan, and MSVC which requires you to declare whether a source file exports an interface or a partition. AFAICT clang might have the only implementation that we can easily get non-trivial examples working with without either scanning sources at configure time, or dynamically generating command line options at compile time.

Jan 03 '24 18:01 dcbaker

We could pretty easily have a rule of "if the name of your file == the name of your exported module you're good to go, otherwise you must pass the cpp_module_name parameter and Meson will map that to -fmodule-output=priv/<cpp_module_name>.pcm, which is not perfect, but makes the common case obvious (foo.cpp has the module foo) and isn't too bad in the less common case.

But in that case cpp_module_name would need to be per-file, right? One would imagine it to be pretty common that one target has many source files, each of which provides one module with most of them being internal modules that are not exported (though, obviously, there are going to be libs that export multiple module files).

Jan 03 '24 21:01 jpakkane

Right, that would need to be per file rather than per target. Which, I admit is not ideal, but at least it's closer than what GCC and MSVC have, and the consumer side is what we'd like

Jan 03 '24 22:01 dcbaker

I ran the last 4 commits, through time to see what the actual run time for the "8 module names" test, to demonstrate the difference in compile time. I ran rm -rf builddir; ../../../meson.py setup builddir --wipe &> /dev/null; time ninja -C builddir at each step, and took the best of 3 runs:

"Fortran targets need to -I...": ninja -C builddir 0.58s user 0.12s system 123% cpu 0.563 total
"depfile generation needs ...": ninja -C builddir 0.60s user 0.11s system 125% cpu 0.566 total
"fix cross module dependencies": ninja -C builddir 0.57s user 0.12s system 99% cpu 0.692 total
"use a two step process ...": ninja -C builddir 0.82s user 0.16s system 188% cpu 0.515 total

What I found was the two step process constantly took less overall time, and had higher CPU utilization. This seems consistent with the idea that ninja is able to do better scheduling, not having to wait until any linking is done, but only for each .mod or .smod file to be produced before starting more work.

Mar 11 '24 20:03 dcbaker