Move the semanticdb generation logic to `compile` and make sure it's shared where possible
This PR optimizes and reworks the compilation pipeline, with regards to SemanticDB generation.
Pre-PR situation
- there were separate
compileandsemanticDbDetailedtasks. compileperformed the compilation without semantic DB plugin.semanticDbDetailedperformed the compilation with the semantic DB plugin, but did not reuse thecompile's output.
Which meant that if any of these happened:
- mill cli invoked
compileand thensemanticDbData. - mill BSP invoked
compileand then any other task that required the semantic db, or vice-versa.
The compilation would have been performed twice, wasting CPU cycles and worsening the developer experience.
Post PR situation
Mill now smartly chooses whether compile produces semanticdb data or not. semanticdb is produced if:
compilewas directly invoked by a task that needs semanticdb.- there is at least one BSP client that requires semanticdb to be produced.
Implementation details
Introduction of MILL_BSP_OUTPUT_DIR
Previously you could use MILL_OUTPUT_DIR environment variable to set both regular and bsp mill's output directory to a certain folder.
Because regular mill now needs to know the location of the BSP folder, having one variable is problematic:
- you run
MILL_OUTPUT_DIR=out_bsp ./mill --bsp ... - you want to run regular mill, but provide it a changed path for bsp mill.
MILL_OUTPUT_DIR=out_bsp ./mill ...changes the regular mill out folder.
Thus MILL_BSP_OUTPUT_DIR is introduced, which allows you to:
MILL_BSP_OUTPUT_DIR=out_bsp ./mill --bsp ...MILL_BSP_OUTPUT_DIR=out_bsp ./mill ... # this still uses the regular out/ folder, but knows where bsp mill out/ folder is located
BuildCtx.bspSemanticDbSessionsFolder
Folder in the filesystem where Mill's BSP sessions that require semanticdb store an indicator file (name = process PID, contents are irrelevant) to communicate to main Mill daemon and other BSP sessions that there is at least one Mill session that will need the semanticdb.
The reasoning is that if at least one of Mill's clients requests semanticdb, then there is no point in running regular compile without semanticdb, as eventually we will have to rerun it with semanticdb, and thus we should compile with semanticdb upfront to avoid paying the price of compling twice (without semanticdb and then with it).
CompilationResult.semanticDbFiles
Because we can't change the return type of compile due to binary compatibility, semanticDbFiles field was added to CompilationResult and compile fills it in if the compilation happened with semanticdb enabled.
Removal of CompileFor and related tasks
These are not needed anymore with the single compile task.
Removal of separate SemanticDbJavaModule.semanticDbDataDetailed task
It's functionality was merged to compileInternal, which takes a compileSemanticDb parameter.
There also was a lot of code duplication between compile and semanticDbDataDetailed tasks, for both java and scala modules.
Replace the implicit Task.dest usage in ZincWorker with explicit compileTo argument
This makes it clearer what the parameter is used for and allows to reuse the same value in SemanticDbJavaModule.enhanceCompilationResultWithSemanticDb invocation.
Misc changes
- Moved
JavaModule#resolveRelativeToOutinstance method toUnresolvedPath.resolveRelativeToOut. - Improved
Serverto provide better debugging output if the server cannot be launched. testScala212Versionupdated to2.12.20because new semanticdb plugin is not provided for the ancient2.12.6version that was used.
Fixes: https://github.com/com-lihaoyi/mill/issues/5744
TBH, the current state (of this PR) looks way to complicated and makes too much assumptions for my taste. But I admit, the current codebase (in this context) is already far from being simple.
I'd like to split the problem and provide clear solutions separated from each other.
Issue 1: Bad compilation performance
- we currently compile too much, since the
semanticDbDatatask duplicates compilation work already done in thecompiletask. - by always compiling with the semanticDB generator enabled, we could optimize the compilation for the BSP use case and would also ensure sync'ed results, but we potenially leak unwanted semanticDb data downstream.
Issue 2: Decide when we need semanticDB data
- explicit: user enabled semanticDB in the module via
scalacOptions- thecompileresult will contain the semanticDB data files and we should not not apply any extra processing - implicit: user uses Metals as the IDE - the
compiletask should not contain any semanticDB data files, as these are considered unwanted results (e.g. they should not appear downstream on classpaths or in jars) - no: we don't need semanticDB data at all - we should not generate it
Proposal for Issue 1:
Disclaimer: proposed task names are not final but choosen to make the concept clear
- create a new persistent
compileWithMaybeSemanticDbtask which does the actual compilation, and include semanticDB data if we, for some reasons, need them. - Let the
compiletask use the result ofcompileWithMaybeSemanticDbbut filter out semanticDB data, iff it was not explicitly requested by the module configuration, e.g. viascalacOptions. - Let the
semanticDbDatatask use the result ofcompileWithMaybeSemanticDband filter out any non-semanticDB data. - We keep the current concept of well-separated tasks with well-defined results. All downstream users, esp. the BSP client or
mill-scalafixplugin keep as-is, but better performing.
Ideas for Issue 2:
-
Maybe too simple, but we could always generate semanticDB data in
compileWithMaybeSemanticDband just don't use it downstream if nobody is interested in it. This has an overhead of up to 20 percent in case nobody is going to need it. (Very unlikely, but it may also conflict with other compiler plugins and fail the compilation that otherwise would succeed.) -
Smart-decision for semanticDB data need. Either project use of the
semanticDbDatatask or any BSP use should permanently enable it. This must be a bullet-proof design, well-documented and users need a way to disable it (opt-out or opt-in). To detect BSP, we should just use the fact that a Mill-generated.bsp/mill-bsp.jsonfile is present, since this won't require any extra book keeping. Users, who don't want to use BSP can also safely remove that file. We could also write an extra file next to this location, so we can check its age, for example. Once the semanticDbData task is used/planned, we may record that fact under the namespace of a dedicated module-specific persistent tasksemanticDbDataGenerationWanted. This has the issue that thesemanticDbDatatask is a downstream dependency ofcompileWithMaybeSemanticDbtask and we currently can't know if the initial value is correct (so factually we always need to guess "enabled"). It would be really cool to have some way to detect early what the user is going to run. E.g. if we could expose the current execution plan via theTaskCtx. That way we could have an early persistent tasksemanticDbDataGenerationWantedthat decides based on it's persistent state and the fact whether thesemanticDbDatatask is requested. Then we could conservatively default to disabled, unless there is more evidence.
@arturaz I think we should try a bit to simplify this. Some feature-flagging on and off is inevitable, but some of the interactions and dependencies introduced here are a bit obscure and unusual
Because regular mill now needs to know the location of the BSP folder, having one variable is problematic:
Why exactly does regular Mill need to know the location of the BSP folder? This is a bit of a weird interaction/dependency that I'd like to cut if at all possible
The reasoning is that if at least one of Mill's clients requests semanticdb, then there is no point in running regular compile without semanticdb, as eventually we will have to rerun it with semanticdb, and thus we should compile with semanticdb upfront to avoid paying the price of compling twice (without semanticdb and then with it).
Rather than than trying to detect this dynamically, let's just hard-code the needs-semanticdb-or-not decision on process start: Metals would spawn a BSP process needing semanticdb, and IntelliJ would spawn a BSP process not needing it. If IntelliJ re-uses a metals process or vice versa, worst case is there may be some inefficiency (either running compile unnecessarily, or generating semanticdbs unnnecessarily), but I think the simplicity is worth the tradeoff.
Last thing is how was this tested. Manual testing and observing a speedup? Add some integration test to make sure that some BSP workflow doesn't call semanticDbDataDetailed? Whatever it is, it should be mentioned in the PR description
I think those are the three last things before we merge this