ccache icon indicating copy to clipboard operation
ccache copied to clipboard

Re-add a unify mode using -P

Open Meinersbur opened this issue 4 years ago • 14 comments

This is a proof-of-concept of using the -P preprocessor flag in cpp2 mode (i.e. only use preprocessor output for has lookup) as outlined in https://github.com/ccache/ccache/issues/812#issuecomment-786519591. Since ccache uses line directives inserted by the preprocessor to determine which files are to be hashed in direct mode, we have to get this information another way. It is done by instructing the preprocessor to emit a .d file and which is parsed the same way as depend mode does.

The motivation of this patch is to increase the chances of a cache hit after only whitespace changes. In particular, removing the line directives may ignore added/removed empty or comment comment lines from the source input. It was possible to ignore even more whitespace changes with CCACHE_UNIFY option, that unfortunately was removed in ccache 3.7.7 (https://github.com/ccache/ccache/commit/0774bfe4c71b342822170057df0aef15d371d892).

What kinds of whitespace changes can be ignored depends on the compiler: gcc's -E -P mode removes all empty lines, but keeps indention and only normalizes horizontal whitespace to at most a single space. clang's -E -P mode even only discards empty lines if there are more than 8 of them. However, I implemented a -fnormalize-whitespace mode in clang (https://github.com/meinersbur/llvm-project/tree/normalize-whitespace) that basically does what ccache's unify did, ignoring almost all formatting changes. If it finds use I would upstream this to llvm project.

This approach holds several advantages:

  • Preprocessed file does not need to be parsed, just to be hashed.
  • All compilers with a gcc-compatible command line interface have to emit a make-compatible dependency file, while the line-directive is compiler-specific.
  • Smaller .i/.ii temporary file, less to hash
  • Ignores more irrelevant source changes, such as adding/removing empty lines or comments, indention
  • Can handle -P being passed as an argument (#812)

Most problems of CCACHE_UNIFY do not apply, only the following disadvantages remain:

  • Compiler diagnostics may be off if e.g. only an empty line is added or removed from the source.
  • Also, debug info line can be off.
  • __LINE__ macros (e.g. in an assert) will cause a miss if the number of lines before changed.

The last is probably what you want, but may reduce the usefulness of this mode. One might want to use an assert that does not print the line. The first two can be mitigated by not using this mode when -g is pass and not caching if there is any diagnostic output. Direct mode hits are still possible. A sloppiness option could still enable -P-mode nonetheless. Even though -fnormalize-whitespace is only supported with clang, the mode could still by uses by pipeing through clang -fpreprocessed -fnormalize-whitespace or by linking libclang into ccache.

This patch is a draft, I only tested it with my workflow. Missing changes in the documentation, statistics, configuration and different compilers and modes (such as invoking only the preprocessor without compilation) may not work. If the project is interested in merging such a patch, I'd appreciate hints on what still needs to be done or if the maintainers take over from here. It helps me being less hesitant to clang-format or edit comments in central header files.

Meinersbur avatar Mar 07 '21 22:03 Meinersbur

Thanks for the idea.

I would suggest changing the title to something like "Re-add a unify mode using -P" so that it's easier to understand the intent.

I'm OK with the idea in principle if the implementation can be made correct without too much added complexity. The devil's in the details as can be witnessed by many tests failing with the current POC.

Preprocessed file does not need to be parsed, just to be hashed.

That's not quite true. The preprocessed file should still be scanned for .incbin and log messages from the distcc-pump wrapper.

All compilers with a gcc-compatible command line interface have to emit a make-compatible dependency file, while the line-directive is compiler-specific.

Right. Note though that ccache supports compilers without 100% GCC-like behavior for dependency files, most notably EDG-based compilers (such as the Green Hills compiler, GHS). I think that the mode has to be opt in.

Can handle -P being passed as an argument (#812)

As noted in https://github.com/ccache/ccache/issues/812#issuecomment-797035331, I currently don't think it's terribly important to support -P.

jrosdahl avatar Mar 13 '21 09:03 jrosdahl

I would suggest changing the title to something like "Re-add a unify mode using -P" so that it's easier to understand the intent.

I changed the title to you suggestion but I think it is misleading because without -fnormalize-whitespace the unification is incomplete.

Preprocessed file does not need to be parsed, just to be hashed.

That's not quite true. The preprocessed file should still be scanned for .incbin and log messages from the distcc-pump wrapper.

Incbin is non-standard and I don't think a lot of projects are using it. The more compatible options would be something like xxd -i, bin2c or ld -b binary. C++23 is considering adding std::embed or #embed to the standard, and they want to ensure that -MD/-MMD will be able to include the filename.

However, for safety and distcc-pump I agree we may still need to parse the file, although I would add a sloppiness option to disable that scanning for projects that know they don't use incbin. I could try implementing detection of .incbin dependency in clang. Note that the current detection is not 100% safe since one can hide the .incbin command like __asm__(".inc" "bin"), depends-mode will miss it, and while the blake3 implementation is efficient and vectorized, the parsing code is not.

Right. Note though that ccache supports compilers without 100% GCC-like behavior for dependency files, most notably EDG-based compilers (such as the Green Hills compiler, GHS). I think that the mode has to be opt in.

OK, I don't know these compilers. What if gcc/clang was detected?

Meinersbur avatar Mar 13 '21 14:03 Meinersbur

What should the option enabling this be called? unify_mode?

Meinersbur avatar Mar 15 '21 13:03 Meinersbur

I changed the title to you suggestion but I think it is misleading because without -fnormalize-whitespace the unification is incomplete.

Even if it's not functionally equivalent to the old unify mode, I guess it still has to be called something (perhaps not in the PR but at least in the documentation) and I thought that "a unify mode" would be fluffy enough to hint that it's not "the unify mode". Better name suggestions are of course welcome.

Incbin is non-standard and I don't think a lot of projects are using it.

I think that the Linux kernel project motivates the detection code just fine even though it's only one project (see #136).

The more compatible options would be something like xxd -i, bin2c or ld -b binary.

If you convince all ccache users that they should use something else then we can for sure drop the detection. :wink:

[...] However, for safety and distcc-pump I agree we may still need to parse the file, although I would add a sloppiness option to disable that scanning for projects that know they don't use incbin.

That would be perfectly fine.

I could try implementing detection of .incbin dependency in clang.

Not sure I follow here. If you suggest depending on a potential future implementation in Clang I think I would need more sales pitches.

Note that the current detection is not 100% safe since one can hide the .incbin command like __asm__(".inc" "bin")

Right, and then the user deserves to get a false cache hit.

depends-mode will miss it

That's a good point. I would consider that a bug in the depend mode. For the record, I'm not convinced that I made the right choice to accept the depend mode into ccache since it adds complexity and is a bit hard to reason about, but as long as it's opt-in it doesn't do too much harm.

and while the blake3 implementation is efficient and vectorized, the parsing code is not.

Of course.

OK, I don't know these compilers. What if gcc/clang was detected?

Doesn't it need to be opt in anyway because of the issues with line numbers in compiler diagnostics, assert macros, etc being potentially off?

What should the option enabling this be called? unify_mode?

I don't have any better suggestion at the moment.

jrosdahl avatar Mar 18 '21 20:03 jrosdahl

The patch is ready for review. If preferred, I could extract parts of it into separate pull requests. To speed-up the review, please feel free to directly edit this PR.

This assumes that support for the -fnormalize-whitespace will be accepted before the Clang 13 release (https://reviews.llvm.org/D104601). I suggest to wait until that before merging.

Meinersbur avatar Jun 20 '21 06:06 Meinersbur

Thanks.

I just wanted to say that I haven't forgotten about this PR, but I'm currently focusing on storage backends so I'll have to come back to this at some later time.

jrosdahl avatar Jul 01 '21 19:07 jrosdahl

OK, usefulness is limited before we get -fnormalize-whitespace anyway.

Meinersbur avatar Jul 01 '21 19:07 Meinersbur

The feature was accepted in clang from clang 14.0.0 (NOT the next release) as -fminimize-whitespace.

Note I rebase only to the commit before accd21e452da13f12877571f90dae7e34feca23b. That commit makes unittests read the user's configuration in .ccache/ccache.conf and cause it fail if it diverges from the default values.

Meinersbur avatar Aug 07 '21 07:08 Meinersbur

Note I rebase only to the commit before accd21e. That commit makes unittests read the user's configuration in .ccache/ccache.conf and cause it fail if it diverges from the default values.

Thanks for noting this. Fixed in a2e6316eae064cb283beb481d2135f5188d0bd9a.

jrosdahl avatar Aug 07 '21 11:08 jrosdahl

I rebased everything to the current master branch. I would appreciate a review.

Meinersbur avatar Aug 16 '21 17:08 Meinersbur

ping

Meinersbur avatar Sep 07 '21 17:09 Meinersbur

ping

Meinersbur avatar Sep 22 '21 05:09 Meinersbur

The status is still that I haven't forgotten about this PR. No ping noise needed or appreciated. Since this PR has become more complex than I initially thought it would I haven't had time yet to think things through enough to give good feedback.

jrosdahl avatar Sep 26 '21 19:09 jrosdahl

The status is still that I haven't forgotten about this PR, but I consider it low priority (sorry) and I don't have time to engage with the topic.

jrosdahl avatar Feb 15 '22 20:02 jrosdahl