contest icon indicating copy to clipboard operation
contest copied to clipboard

[jobmanager] Recover from job panics

Open xaionaro opened this issue 2 years ago • 2 comments

An arguable proposal (feel free to just reject it without any explanation).

Mitigating problems like this one:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x2e4c403]

goroutine 17643 [running]:
osf/contest/plugins/reporters/purgatory.(*Reporter).getRackSerial(0x0?, {0xc000a72f40?, 0x0?, 0x0?})
fbcode/osf/contest/plugins/reporters/purgatory/purgatory.go:272 +0x83
osf/contest/plugins/reporters/purgatory.(*Reporter).getFinalReport(0x312be36?, {0xc000a72f40?, 0x1, 0x1})
fbcode/osf/contest/plugins/reporters/purgatory/purgatory.go:336 +0x19d
osf/contest/plugins/reporters/purgatory.(*Reporter).FinalReport(0xc000253088, {0xb25ae0, 0xc001168730}, {0x5c5b00?, 0xc000e19200?}, {0xc000a72f40?, 0xc00144b7e0?, 0xc00144b7e0?}, {0xb17cc0, 0xc0010c2960})
fbcode/osf/contest/plugins/reporters/purgatory/purgatory.go:761 +0x18c
github.com/linuxboot/contest/pkg/runner.(*JobRunner).Run(0xc000270900, {0xb25ae0?, 0xc0011685a0}, 0xc000afdc20, 0x0)
fbcode/third-party-source/go/github.com/linuxboot/contest/pkg/runner/job_runner.go:261 +0x1d83
github.com/linuxboot/contest/pkg/jobmanager.(*JobManager).runJob(0xc0001e5d90, {0xb25ae0, 0xc001629310}, 0xc000afdc20, 0xc0007dcf01?)
fbcode/third-party-source/go/github.com/linuxboot/contest/pkg/jobmanager/start.go:110 +0x325
created by github.com/linuxboot/contest/pkg/jobmanager.(*JobManager).startJob
fbcode/third-party-source/go/github.com/linuxboot/contest/pkg/jobmanager/start.go:85 +0x290

If a single job fails the whole instance has no need to panic.

xaionaro avatar Jul 04 '23 15:07 xaionaro

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.02 :tada:

Comparison is base (fa98f00) 61.77% compared to head (2ab1456) 61.80%.

:exclamation: Current head 2ab1456 differs from pull request most recent head dfc6f87. Consider uploading reports for the commit dfc6f87 to get more accurate results

:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #169      +/-   ##
===========================================
+ Coverage    61.77%   61.80%   +0.02%     
===========================================
  Files          131      131              
  Lines         9228     9234       +6     
===========================================
+ Hits          5701     5707       +6     
  Misses        2855     2855              
  Partials       672      672              
Flag Coverage Δ
e2e 49.71% <100.00%> (+0.04%) :arrow_up:
integration 56.86% <100.00%> (+<0.01%) :arrow_up:
unittests 46.03% <0.00%> (-0.15%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/jobmanager/jobmanager.go 77.41% <100.00%> (+0.24%) :arrow_up:
pkg/jobmanager/start.go 76.85% <100.00%> (+0.89%) :arrow_up:

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar Jul 04 '23 15:07 codecov-commenter

what about the other possible failures? this now only handles the job start case, but other api events handling may fail. Am I reading this wrong?

mimir-d avatar May 09 '24 20:05 mimir-d