msbuild icon indicating copy to clipboard operation
msbuild copied to clipboard

Building solution with forced out-of-proc nodes fails

Open rainersigwald opened this issue 5 years ago • 9 comments

📝 I'm not sure we're going to be very motivated to fix this, but it cost me a lot of time debugging bootstrapped-build failures in #3365, so I'm filing it.

Steps to reproduce

Build a solution using the MSBuild task

<Project>
 <Target Name="BuildSln">
  <MSBuild Projects="some.sln" BuildInParallel="true" />
 </Target>
</Project>

While using MSBUILDNOINPROCNODE, this results in an error

S:\msbuild\some.sln(2,1): error MSB4025: The project file could not be loaded. Data at the root level is invalid. Line 2, position 1.

Why?

Solution metaprojects are built based on an in-memory instance which isn't transferable. See

https://github.com/Microsoft/msbuild/blob/2a012e653766eb261b09b29b3106f4eb57a7f61d/src/Build/BackEnd/Components/Scheduler/Scheduler.cs#L1254-L1260

So when forced to out-of-proc nodes, the solution build fails.

Environment data

msbuild /version output: 15.8.139-preview+g5951330944 for .NET Framework (also on Core)

rainersigwald avatar Jul 16 '18 15:07 rainersigwald

Late-joining this thread just to give an additional headsup: Getting the MSB4025 error on your "head-less" build servers doesn't necessarily mean that you are having issues with

MSBUILDNOINPROCNODE=1

There's yet another, far simpler reason due to which this exact error on the .sln might occur:

Your .sln file might have indeed become malformed due to a bad merge on it. Check this out for a description of the problem and a solution:

https://stackoverflow.com/a/52932050/863651

Just my 2c.

dsidirop avatar Oct 22 '18 15:10 dsidirop

I just hit this, and it looks like you don't have to use the MSBuild task. I get it when I run msbuild /restore /nr:false in a folder with a .sln file and I have MSBUILDNOINPROCNODE set to 1.

dsplaisted avatar Apr 21 '20 23:04 dsplaisted

Setting MSBUILD_PROJECTINSTANCE_TRANSLATION_MODE to full works around this issue as it makes sure that the ProjectInstance is correctly moved to the out-of-proc node. And it looks like a lead to a reasonable fix as well.

  1. Make ProjectInstance.TranslateEntireState return true for meta projects.
  2. Remove the assert.

ladipro avatar Apr 22 '21 14:04 ladipro

@rokonec,

I tried setting MSBUILDNOINPROCNODE=1 and running build.cmd, and it failed with a recent MSBuild. @BenVillalobos confirmed he had the same result, so I think this is not fully resolved, unfortunately.

Forgind avatar Aug 26 '21 23:08 Forgind

@cdmihai says this reproduces in 16.11 but not 17.0: https://github.com/dotnet/msbuild/issues/6818#issuecomment-914542505

Do we know what fixed it?

KirillOsenkov avatar Sep 09 '21 01:09 KirillOsenkov

It gets weirder and weirder. When I set the translation mode env var to full, I get... an OOM from the node process??

image

16.11.0+0538acc04

KirillOsenkov avatar Sep 09 '21 01:09 KirillOsenkov

That wasn't my experience. It failed with 17.0: image

I would guess that #6385 made this substantially better for most cases (or maybe just empty .slns? .slns with 0-1 projects?) but not all cases.

Forgind avatar Sep 09 '21 15:09 Forgind

It should be using image

Forgind avatar Sep 09 '21 15:09 Forgind

I can reproduce this issue on VS2019 16.11.3 (MSBuild 16.11.0+0538acc04) with a recent update to MSBuildProjectCreation as discovered during analysis of https://github.com/jeffkl/MSBuildProjectCreator/issues/128

japj avatar Sep 16 '21 14:09 japj

Hmm, I can still repro this on 17.5.1+f6fdcf537 for .NET Framework, by following the exact repro steps at the top.

I thought Roman's change should have fixed it? Is there another bug lurking here?

image

KirillOsenkov avatar Apr 23 '23 01:04 KirillOsenkov

From what I currently understand, the solution metaproject information is not properly transferred between nodes and possibly cannot be transferred between nodes. @mattaquinn commented in https://github.com/dotnet/msbuild/issues/8184#issuecomment-1339629311 that adding MSBUILDEMITSOLUTION=1 resolved the problem but also leaves extra solution metaproject files lying around, as should be expected with that escape hatch.

As I recall, when we try to load the solution metaproject file in a node, we first check whether it's in our in-memory cache, and if that fails, we go to disk. (Even metaprojects not written to disk have a pretend path.) My hunch is that if we have an in-proc node, with rokonec's change, we translate the full state and send it to all the other nodes. If not, there's no node to do that properly, since there's no central scheduler node. Writing it to disk via MSBUILDEMITSOLUTION means that although our in-memory cache check still fails, we actually find it on disk, so we can load it in each worker node and use it.

The ideal solution from a good engineering perspective might be to focus on the translation. If the solution metaproject is not created, parse the solution file properly and send that information to the other worker nodes, but there's no node that would be the clear one to take priority and do that.

I propose having MSBUILDNOINPROCNODE implicitly turn on MSBUILDEMITSOLUTION, but, at the end of the build, we then make sure to clean up, i.e., delete the emitted metaproject again.

What do people think of that as a solution for this? Note that it's been open since 2018, so I'm not overly optimistic about a complex fix actually being resolved 😉

Forgind avatar May 02 '23 15:05 Forgind

If you set MSBUILDEMITSOLUTION implicitly, then would it be feasible to emit the metaproject files to a temporary directory rather than the solution directory? To prevent conflicts in parallel MSBuild invocations, or errors if the directory is not writable. OTOH, perhaps that would cause MSBuild to look for solution customisations in the wrong directory.

KalleOlaviNiemitalo avatar May 02 '23 17:05 KalleOlaviNiemitalo

I don't think we should have to worry about parallel MSBuild invocations, as this is for building .sln files—the sln itself isn't mutlitargeted, and if you try to build it multiple times in parallel, we may have intermediate outputs that overlap and lead to contention anyway. What are the solution customizations you're thinking of? Something related to the project cache?

Forgind avatar May 02 '23 22:05 Forgind

I meant building a solution with a build directory tree that is separate from the source directory tree, which could then be read-only. This would likely require the caller of MSBuild to set properties to specify the directory paths. Such builds done in parallel could involve different target platforms or other compiler options.

With "solution customizations", I meant files such as Directory.Solution.props and Directory.Solution.targets, which should still be read from the directory of the solution file even if the generated metaproj files are instead saved to a temporary directory.

KalleOlaviNiemitalo avatar May 03 '23 02:05 KalleOlaviNiemitalo

Hmm...that is an excellent point. MSBuild looks for quite a few things right next to it and uses relative paths for more, so moving it to another directory would be tricky, but I can see why writing an actual file to the source directory might fail.

We could theoretically try to copy everything into a temp directory, run the build, and copy everything back, but I don't like that solution at all. Doing a partial copy feels like it might be a bug farm to me. I'd like to emit the metaproj to a temp directory then tell MSBuild it's in the real directory, but that would defeat the point of really emitting it in the first place. It sounds like there are real, functional reasons to implement a full cross-node serialization scheme, but I'm not sure that's happening any time soon.

I'm tempted to go back to my original proposal and just accept that there will be some cases in which it will still fail, but it would be nice to know how common those cases are, and I truly have no idea on that point.

Forgind avatar May 03 '23 23:05 Forgind