code-base-investigator icon indicating copy to clipboard operation
code-base-investigator copied to clipboard

Cache parsing and preprocessing results

Open Pennycook opened this issue 9 months ago • 1 comments

Feature/behavior summary

Running codebasin, cbi-cov or cbi-tree can take a long time for large projects, because each file must be parsed and preprocessed each time the command is run.

We could make things significantly faster (and more easily enable new use-cases) if we cached parsing and preprocessing results between command invocations.

Request attributes

  • [ ] Would this be a refactor of existing code?
  • [ ] Does this proposal require new package dependencies?
  • [ ] Would this change break backwards compatibility?

Related issues

No response

Solution description

To cache parse results:

  • [ ] Introduce a way to serialize specialization trees
  • [ ] Cache the specialization tree, using something like a hash of the file contents as a key
  • [ ] Skip parsing if a specialization tree already exists in the cache

To cache preprocessing results:

  • [ ] Cache coverage JSON files, using a combination of the specialization tree and PreprocessorConfiguration
  • [ ] Skip preprocessing if coverage for a specialization tree already exists in the cache

Additional notes

There are a few options for where we could save these files, with different trade-offs.

Using .cbi/cache would result in a per-project cache, which may help developers to identify that there is a cache and investigate how it is being used.

Using .cache/cbi/ would result in a per-user cache, which might be easier to handle (because we wouldn't have to worry about concurrent updates to the cache from multiple users), and which would allow for a single cache to store the results of common files (e.g., library headers).

I'm leaning towards .cache/cbi, with an option to allow developers to specify a different cache location.

Pennycook avatar Mar 25 '25 14:03 Pennycook

I built a small proof-of-concept today that dumps the coverage files for the HACC case-study, just to see how big the cache would be. It's about 11 MB (uncompressed), and the code base is about 350,000 lines of code.

Writing this proof-of-concept also made it clear to me that this is not going to be a quick and easy thing to implement, at least not before implementing #100 and/or #135. All of the analysis functions expect a setmap, but it doesn't make sense to cache one of those. In order to cache the individual coverage information associated with a specific invocation of the preprocessor, we'll need to refactor things so that functionality can be pipelined in the following way:

  1. ArgumentParser converts a CompileCommand into a PreprocessorConfiguration.
  2. The Preprocessor is invoked with a specific PreprocessorConfiguration to produce Coverage information.
  3. The Coverage information for each platform is used for analysis.

The current implementation invokes the Preprocessor for multiple PreprocessorConfigurations at once, and all the coverage information from different platforms gets mixed together in one data structure.

Pennycook avatar Mar 26 '25 15:03 Pennycook