arcade icon indicating copy to clipboard operation
arcade copied to clipboard

llvm-symbolizer not present in base queue

Open kunalspathak opened this issue 2 years ago • 53 comments

Build

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-77578-merge-965165820fec43e19e/JIT.Stress/1/console.f7c5d70b.log?helixlogtype=result

https://dev.azure.com/dnceng-public/public/_build/results?buildId=82793&view=ms.vss-test-web.build-test-results-tab&runId=1731386&resultId=102137&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

Pull Request

https://github.com/dotnet/runtime/pull/77578

Action required for the engineering services team

Additional information about the issue reported

To triage this issue (First Responder / @dotnet/dnceng):

  • [ ] Open the failing build above and investigate
  • [ ] Add a comment explaining your findings

In https://github.com/dotnet/runtime/pull/77578, we are trying to generate the crash stacktrace using llvm-symbolizer. While it is present in containers, the base Linux and macOS queues doesn't have it and we see error using it. See the logs I referenced in the issue. Can we get it and lldb installed on base image?

CC: @hoyosjs @JulieLeeMSFT

Release Note Category

  • [x] Feature changes/additions
  • [ ] Bug fixes
  • [ ] Internal Infrastructure Improvements

Release Note Description

Add llvm and llvm-symbolizer to Ubunut.1804.Amd64 and RedHat.7.Amd64

kunalspathak avatar Nov 14 '22 21:11 kunalspathak

Hi Kunal, we will get on this. @hoyosjs do you know if this just comes built in with llvm? lldb 3.9 is already being installed on the base ubuntu.1804 queues. Do you need a different version? This is the test queue, so I don't think it would be an issue to upgrade that to something newer, but I'd like to check before making any major changes.

michellemcdaniel avatar Nov 14 '22 21:11 michellemcdaniel

Do you know why 3.9? And llvm sounds good.

hoyosjs avatar Nov 14 '22 22:11 hoyosjs

I do not know why 3.9. Possibly historic reasons? @MattGal it looks like we set our lldb version to 3.9 back in 2020. Do you know why we're using that?

Edit Oh, actually, we set this in 2019.

Edit: that is also a lie. I am still digging to how long ago we chose 3.9 and never updated it.

michellemcdaniel avatar Nov 14 '22 22:11 michellemcdaniel

Probably for diagnostics...

hoyosjs avatar Nov 14 '22 22:11 hoyosjs

Yeah. I think that's also what's on the docker images that y'all are using and upgrading to something more modern is also breaking things. I worry updating that will break y'all

michellemcdaniel avatar Nov 14 '22 22:11 michellemcdaniel

@kunalspathak we support several different linux distros, not all of which may have a usable version of llvm-symbolizer. Would it be acceptable if this were only added to Ubuntu Helix machines, or do you need it everywhere? Odds are it's not going to work with some of our more unusual linuxes.

MattGal avatar Nov 14 '22 23:11 MattGal

@kunalspathak we support several different linux distros, not all of which may have a usable version of llvm-symbolizer. Would it be acceptable if this were only added to Ubuntu Helix machines, or do you need it everywhere? Odds are it's not going to work with some of our more unusual linuxes.

@hoyosjs - what do you think?

kunalspathak avatar Nov 14 '22 23:11 kunalspathak

Updating the queues the runtime uses directly would be the first priority:

  • Ubuntu.1804.Amd64.Open
  • RedHat.7.Amd64.Open
  • OSX.1200.ARM64

We'll have to evaluate the helix containers, but those are much easier to update and we've even built the toolset in some of the containers historically.

hoyosjs avatar Nov 14 '22 23:11 hoyosjs

@MattGal do you know where the symbolizer might not be available? cc: @jkoritzinsky since this might be interesting for your *SAN work

hoyosjs avatar Nov 14 '22 23:11 hoyosjs

@MattGal do you know where the symbolizer might not be available? cc: @jkoritzinsky since this might be interesting for your *SAN work

Offhand I'd venture it might not be available on old SLES or Mariner. It's one of those things we don't know until we try.

MattGal avatar Nov 15 '22 00:11 MattGal

Those don't tend to impact our priority scenario - the PR analysis checks

hoyosjs avatar Nov 15 '22 02:11 hoyosjs

PR to add them to the two linux based queues: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/27535

I think for OSX, we're going to have to get ddfun involved

michellemcdaniel avatar Nov 16 '22 18:11 michellemcdaniel

Opened https://portal.microsofticm.com/imp/v3/incidents/details/349676322/home to get llvm added to the OSX queue.

michellemcdaniel avatar Nov 17 '22 17:11 michellemcdaniel

(Moved to tracking while we wait for DDFun to update the systems)

michellemcdaniel avatar Nov 17 '22 17:11 michellemcdaniel

(Moved to tracking while we wait for DDFun to update the systems)

@michellemcdaniel do we know the time estimate until DDFun to update the system?

JulieLeeMSFT avatar Nov 18 '22 20:11 JulieLeeMSFT

I do not. I know it's been assigned, but I haven't seen any movement on it. I will ping the ICM

michellemcdaniel avatar Nov 18 '22 21:11 michellemcdaniel

In general, it takes 1-2 weeks to get this many systems updated (100ish machines), and next week is Thanksgiving, so it's likely going to be at the longer end of that estimate.

michellemcdaniel avatar Nov 18 '22 21:11 michellemcdaniel

PR to add them to the two linux based queues: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/27535

Does this rollout llvm to our linux helix queues? I kicked off a run on #77578 that would consume it and still see failure about llvm-symbolizer not present. See https://dev.azure.com/dnceng-public/public/_build/results?buildId=94545&view=ms.vss-test-web.build-test-results-tab .

kunalspathak avatar Nov 28 '22 18:11 kunalspathak

We did not have a rollout last week due to the US holiday. The linux changes should rollout this week.

michellemcdaniel avatar Nov 28 '22 18:11 michellemcdaniel

Heads up: DDFun says the OSX queue has been updated to have llvm on them

michellemcdaniel avatar Dec 02 '22 22:12 michellemcdaniel

I tried this out but seems there is still some issue.

Test Infrastructure Failure: System.ComponentModel.Win32Exception (2): An error occurred trying to start process 'llvm-symbolizer' with working directory '/private/tmp/helix/working/ADD7099B/w/A75E0909/e'. No such file or directory

kunalspathak avatar Dec 03 '22 00:12 kunalspathak

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-77578-merge-7245de3e3bb44b4383/JIT.Stress/1/console.ba62542f.log?helixlogtype=result

kunalspathak avatar Dec 03 '22 00:12 kunalspathak

@kunalspathak the job was executed in the queue osx.1200.amd64.open but the request was to install llvm in OSX.1200.ARM64 so it is expected for it to not be available in the amd64 queue. In which queue do you need it?

ulisesh avatar Dec 05 '22 22:12 ulisesh

was executed in the queue osx.1200.amd64.open but the request was to install llvm in OSX.1200.ARM64 so it is expected for it to not be available in the amd64 queue. In which queue do you need it?

I just noticed this from @hoyosjs . I think we also need it for OSX x64, right @hoyosjs ?

Updating the queues the runtime uses directly would be the first priority:

  • Ubuntu.1804.Amd64.Open
  • RedHat.7.Amd64.Open
  • OSX.1200.ARM64

kunalspathak avatar Dec 05 '22 23:12 kunalspathak

Yes, sorry - it would be needed on osx.*.*.open

hoyosjs avatar Dec 06 '22 00:12 hoyosjs

Results of investigation into creating a brewless LLVM artifact:

LLVM distributes a tarball of binaries for ARM64 macOS but not amd64. The only idea I have is that we could produce our own tar.xz or even our own pkg installer of amd64 darwin binaries (either built from source or brew installed locally) but that would be a massive pain to keep up-to-date since I don't think the vendors have access to mac hardware and I don't know that it's reasonable to have an FTE with a mac build and/or install llvm every three months.

cc/ @Chrisboh

jonfortescue avatar Dec 06 '22 22:12 jonfortescue

Let's add the install of this as part of the work DDFun has to do manually to setup a machine. @hoyosjs / @kunalspathak do understand that any time we need to change / update this it will take a considerable amount of time to change. Do you think this is something that will need to change often?

Chrisboh avatar Dec 06 '22 23:12 Chrisboh

Barring format changes on apple's behalf, I don't expect this to change often at all.

hoyosjs avatar Dec 07 '22 21:12 hoyosjs

Created https://portal.microsofticm.com/imp/v3/incidents/details/358905819/home to have DDFun do this for all mac open queues.

jonfortescue avatar Jan 05 '23 21:01 jonfortescue

@jonfortescue should this be closed and/or superseded by @ulisesh 's FR work?

MattGal avatar Feb 15 '23 19:02 MattGal