bloop icon indicating copy to clipboard operation
bloop copied to clipboard

Bloop server reports incorrect compile errors when client requests every target be compiled.

Open jackkoenig opened this issue 2 months ago β€’ 12 comments

Given Project B depends on Project A. If my client sends compile requests in order, e.g.

  1. Compile A
  2. Wait for compile response
  3. Compile B

The compilation will fail [incorrectly] in a way that clearly shows something wrong with the classpath from A. The two failure modes I've seen are:

  • Null pointer exception in scalac looking at the classpath directories
  • Simple compile error that shows something is missing (e.g. foo is not a member of Bar where Bar comes from A and foo is definitely a val on the class)

Now this doesn't happen for most things, but does happen consistently in a very large codebase with lots of build units.

A similar issue I'm seeing that happens much more frequently is if we issue the compile requests out-of-order. I can't be sure it's the same issue but the errors manifest in similar ways so I assume this is just a quicker way to hit the same problem:

  1. Compile B
  2. Compile A

If I just compile B, it compiles both no problem, but sometimes (reproducibly in my code base but not on every pair of dependent projects) following up quickly with the request to compile A will cause the B compilation to fail with an error suggesting that A is missing from the classpath (or at least is partially missing).

Other potentially relevant information:

  • I use incrementing TaskIds for each request with no parent ids (I don't yet understand what the child-parent relationship is for)
  • I use different OriginIds for each request

I'm trying to debug to get more information and reproduce on a smaller, open-source code base. Will report back with anything I figure out.

Note that for each of these if I reissue the Compile B request after the initial failure, it will compile successfully. I think there's something going on with Bloop moving or deleting the classes of A while B is compiling.

jackkoenig avatar Oct 20 '25 18:10 jackkoenig

I've been debugging this a bit and have a more detailed assessment:

I can confirm that the issue is that Bloop is deleting internal classes directories that still need to be used by outstanding compilations.

My internal codebase is more complicated than this, but i think it comes down to the following project structure and compile requests[^1]:

Project A Project B depends on A Project C depends on B

My client issues 4 compile requests (with originId of the request in brackets):

  • [1] Compile C
  • [2] Compile A
  • <wait till compile response received for 2>
  • [3] Compile B
  • <wait till compile response received for 3>
  • [4] compile C

The problem is that compile with originId 1 (the first Compile C) fails where classes from A are missing from the classpath. The basic problem is that the bloop-internal-classes directory for A has been deleted but is still being used by the compilation of C for originId 1.

There are lots of logs, but in my debugging I think the following are interesting (in order). I’ve included the originId that each message was sent to:

  1. [1] Create new counter classes-empty-A
  2. [2] Create new counter classes-empty-A
  3. [2] Decrementing counter for classes-empty-A to 1 (I added this debug log)
  4. [3] Create new counter bloop-internal-classes/A-classes-<hash>
  5. [3] Finished copying classes from bloop-internal-classes/A-classes-<hash> to stable location requested via clientClassesRootDir
  6. [3] Decrementing counter for bloop-internal-classes/A-classes-<hash> to 0 (I added this debug log)
  7. [3] Deleting contents of orphan dir bloop-internal-classes/A-classes-<hash>
  8. [1] The Scala compiler is invoked with… classpath including both classes-empty-A and bloop-internal-classes/A-classes-<hash> (this is compiling C)

(8) is the compile that fails, the main thing that puzzles me is:

Why are [1] and [2] gatekeeping on the default initial empty classes directory instead of the one where the output artifacts are actually used? My instinct is that the minimal bug fix is related to this.

Another question is why is the compile of C for originId 1 is using the internal classes directory for A rather than the stable location from clientClassesRootDir?

I'm hopeful this is enough information to clue the developers into what might be wrong or at least to how I can reproduce this in a simple example. I think timing is involved. Are there unit tests that are able to force certain timing/ordering that might be able to reproduce this sort of issue?

[^1]: Unfortunately I think the timing does matter a lot so I think the several other build units I have also being compiled are part of causing it to manifest.

jackkoenig avatar Nov 19 '25 22:11 jackkoenig

Why are [1] and [2] gatekeeping on the default initial empty classes directory instead of the one where the output artifacts are actually used? My instinct is that the minimal bug fix is related to this.

At first we don't have any previous results, which why we have the empty directory set. I guess we should gatekeep both.

Another question is why is the compile of C for originId 1 is using the internal classes directory for A rather than the stable location from clientClassesRootDir?

I think that's the default behaviour to be able to compile downstream projects right away and just copy files to the client directory concurrently.

I'm hopeful this is enough information to clue the developers into what might be wrong or at least to how I can reproduce this in a simple example. I think timing is involved. Are there unit tests that are able to force certain timing/ordering that might be able to reproduce this sort of issue?

BspCompileSpec has some similar tests, one issue that we fixed that was also related to races add the add-method-during-compilation test.

I can try and take a look at this, unless you have the time to work on the reproduction test?

PS: Sorry for not replying earlier, I had this in my TODO list for too long πŸ˜“

tgodzik avatar Nov 20 '25 16:11 tgodzik

Btw. does this problem happen only at the start if no other compilation was running?

tgodzik avatar Nov 20 '25 16:11 tgodzik

Why are [1] and [2] gatekeeping on the default initial empty classes directory instead of the one where the output artifacts are actually used? My instinct is that the minimal bug fix is related to this.

At first we don't have any previous results, which why we have the empty directory set. I guess we should gatekeep both.

Yeah I think "both" makes more sense. Since I'm looking at a first compile I couldn't see any reason why you would care about the previous result (it being empty of course), but I assume on subsequent compiles you might actually use the previous result if the dependency doesn't need recompilation so it makes sense to gatekeep on it as well. But yeah I think not gatekeeping on the results directory that you plan to use in a downstream compile is probably the problem.

Another question is why is the compile of C for originId 1 is using the internal classes directory for A rather than the stable location from clientClassesRootDir?

I think that's the default behaviour to be able to compile downstream projects right away and just copy files to the client directory concurrently.

Yeah makes sense, I suspect this is where the timing comes in. The compilation of the downstream project is delayed a bit in my code base due to other build units--if one tried to reproduce my description above it probably wouldn't work since we need that delay.

I'm hopeful this is enough information to clue the developers into what might be wrong or at least to how I can reproduce this in a simple example. I think timing is involved. Are there unit tests that are able to force certain timing/ordering that might be able to reproduce this sort of issue?

BspCompileSpec has some similar tests, one issue that we fixed that was also related to races add the add-method-during-compilation test.

I can try and take a look at this, unless you have the time to work on the reproduction test?

Thank you very much for the pointers! I will give reproducing it a try and report back.

PS: Sorry for not replying earlier, I had this in my TODO list for too long πŸ˜“

There's nothing to apologize for, thank you for responding πŸ™‚

Btw. does this problem happen only at the start if no other compilation was running?

That is a good question and I'm not sure. I doubt other compilations matter but I think I can check this with my project structure.

Is there any other information I can gather that you think will help narrow it down? I've got detailed logs and am comfortable enough in the codebase to add debug prints.

I have a branch where I let GPT-5 try to fix it (before I understood the problem a bit better). It might actually be on to something (it did fix this test case) but the solution is very AI-y so I didn't want to try PR-ing it. Here in case it is useful to you: https://github.com/jackkoenig/bloop/tree/debug-race (check individual commits, GPT-5 undid some of its changes).

jackkoenig avatar Nov 20 '25 16:11 jackkoenig

I wonder if it wouldn't make sense to only clean up old directories if no compilation request is running. We could just gather a list of those directories. At the end of any compilation just check if no other is running. That requires a single counter instead of multiple and reduces the counter juggling that is happening now.

I tried another approach in https://github.com/scalacenter/bloop/compare/main...tgodzik:bloop:guard-also-output?expand=1 but it always risks race conditions.

So the alternative would be to do something akin to https://github.com/scalacenter/bloop/compare/main...tgodzik:bloop:alternative?expand=1

tgodzik avatar Nov 21 '25 19:11 tgodzik

I wonder if it wouldn't make sense to only clean up old directories if no compilation request is running. We could just gather a list of those directories. At the end of any compilation just check if no other is running. That requires a single counter instead of multiple and reduces the counter juggling that is happening now.

That at least does fix the problems I'm seeing, namely https://github.com/scalacenter/bloop/compare/main...tgodzik:bloop:alternative?expand=1 seems to work for me although I'm sure more extensive testing would be warranted.

I can say that guard-also-output does not fix the problem so perhaps there's more to this than we've figured out so far.

I am trying to minimize the issue, see https://github.com/scalacenter/bloop/compare/main...jackkoenig:bloop:premature-classes-delete-test?expand=1 but it's not manifesting the issue yet. I am not sure what exactly is wrong but enabling debug logging shows that the internal classes directories are not deleted at all. There's clearly something else about my real use case that causes that to happen. Some ideas:

  1. Maybe client id matters? I'm unsure how to do these test compiles with different client ids.
  2. Maybe I need more build units? I already added "slow" (in addition to the A, B, and C I talk about above) to delay the compilation of C, but my expectation that the B compile will trigger deletion of the A internal classes directory still isn't happening.
  3. Maybe there's something else timing sensitive that matters. I'm a bit puzzled because this occurs 100% of the time in my real use case so the tolerances on the race condition are pretty forgiving, but I must be missing something.

jackkoenig avatar Nov 21 '25 22:11 jackkoenig

I can say that guard-also-output does not fix the problem so perhaps there's more to this than we've figured out so far.

I figured there might be more going on, but I would rather skip complicated logic than try to fix it. And even if we don't remove a single directory it's less of an issue than removing too early. I will try and go that direction, add some tests.

For the reproduction it sometimes helps to add macros which use Thread.sleep. That usually exacerbates the possible problem.

tgodzik avatar Nov 21 '25 23:11 tgodzik

I figured there might be more going on, but I would rather skip complicated logic than try to fix it. And even if we don't remove a single directory it's less of an issue than removing too early. I will try and go that direction, add some tests.

Indeed, and simpler logic is always easier to maintain in the future πŸ™‚. Thank you for looking into this, I'm already able to make some forward progress on my end with your branch above.

For the reproduction it sometimes helps to add macros which use Thread.sleep. That usually exacerbates the possible problem.

Yeah I saw that, a very clever approach. I did use it in my test but just for some reason it's not deleting the internal class directories at all in my test--I'm sure it's a small thing I'm missing.

jackkoenig avatar Nov 21 '25 23:11 jackkoenig

If you're not able to create a test I could anyway proceed with the fix from https://github.com/scalacenter/bloop/compare/main...tgodzik:bloop:alternative?expand=1

I would just create some tests to make sure old directories are removed.

tgodzik avatar Nov 25 '25 19:11 tgodzik

If you're not able to create a test I could anyway proceed with the fix from https://github.com/scalacenter/bloop/compare/main...tgodzik:bloop:alternative?expand=1

I would just create some tests to make sure old directories are removed.

I'll keep poking to try to get my test to reproduce, but I would greatly appreciate the fix deployed πŸ™‚

jackkoenig avatar Nov 25 '25 21:11 jackkoenig

I will take a look next week, since I have short break coming up.

tgodzik avatar Nov 25 '25 22:11 tgodzik

Same, Happy Thanksgiving from the US πŸ™‚ πŸ¦ƒ

jackkoenig avatar Nov 25 '25 22:11 jackkoenig

hey i want to work on this issue .

ankitkumarrain avatar Dec 17 '25 14:12 ankitkumarrain

Thanks, but I will most likely finish it this week. I only plan to add additional test to the simplified approach

tgodzik avatar Dec 17 '25 16:12 tgodzik

hey i am new to scala github for contribution , Is there is any community or specific one to disscuss task or hurdles

ankitkumarrain avatar Dec 18 '25 06:12 ankitkumarrain

If you are interested in contributing you can take a look at the issues in https://github.com/scalameta/metals or https://github.com/scalacenter/scalafix which could be a bit more friendly in terms of complexity. The issues in Bloop rely on intricate knowledge of the JVM. You can also join the Scala discord and ask around there. https://scala-lang.org/community/

As for the issue at hand I am looking into testing, but found some weirdness going on, so investigating it currently.

tgodzik avatar Dec 19 '25 12:12 tgodzik