openpilot icon indicating copy to clipboard operation
openpilot copied to clipboard

openpilot failure after overtemp startup

Open jyoung8607 opened this issue 2 years ago • 2 comments

Describe the bug

Had openpilot freak out on me earlier today. Don't know if the overtemp factor is causative or merely correlated.

  • Started up in a hot environment, offroad danger zone, device stayed offroad as expected
  • After a short time driving, device went onroad but almost immediately into the onroad red zone (undesirable but expected)
  • After an additional short time of driving, the UI behaved unexpectedly:
    • Behavior like a UI crash, saw the comma logo alone for about a minute
    • UI started up, but very slowly/poor responsiveness?
    • UI reported the offroad danger temp alert, but I think it was a stale alert
    • UI ran for maybe 15-20 seconds but never went onroad
    • "crashed" again and repeated this cycle for the remainder of my drive, maybe three times total?

I actually could not terminate this easily without unplugging the C3. Shutting off the car didn't break the cycle, and I didn't quite manage to get the UI to scroll down to the reboot button before it "crashed" again.

Provide a route where the issue occurs

3cfdec54aa035f3f|2023-05-15--14-59-48

openpilot version

717bc04ddc330c43e794f28ee6ff3a287425112e

Additional info

Unmodified master as of a couple days ago. I haven't tried to do much analysis, other than note there aren't any UI crash dumps uploaded. It also looks like both forward cameras stopped encoding (fcam/ecam plus qcams) but dcams kept going.

image

jyoung8607 avatar May 15 '23 23:05 jyoung8607

Unfortunately, your device is one of the few affected by https://github.com/commaai/openpilot/pull/25959. Blocking startup in this case was actually pretty successful in keeping the CPU/GPU <=90C. The PMIC crossed 100C within a few seconds, and I suspect that caused some of these issues, though this should be handled more gracefully.

adeebshihadeh avatar May 16 '23 00:05 adeebshihadeh

While not explicitly logged, about 32 seconds into the drive we can infer that emergency thermal mitigation via CPU hotplug took place. Losing a substantial fraction of our compute does explain the behavior.

(openpilot-py3.8) jyoung@DESKTOP-6JPRDTA:~/openpilot/selfdrive/debug$ ./filter_log_message.py "3cfdec54aa035f3f|2023-05-15--14-59-48" | grep affine
[153674.796061] MAIN 0 kernel - IRQ237 no longer affine to CPU5
[153674.796111] MAIN 0 kernel - IRQ238 no longer affine to CPU5
[153674.796161] MAIN 0 kernel - IRQ239 no longer affine to CPU5
[153674.796400] MAIN 0 kernel - IRQ240 no longer affine to CPU5
[153674.799377] MAIN 0 kernel - IRQ241 no longer affine to CPU5
[153674.799485] MAIN 0 kernel - IRQ242 no longer affine to CPU5
[153674.799538] MAIN 0 kernel - IRQ243 no longer affine to CPU5
[153674.799587] MAIN 0 kernel - IRQ244 no longer affine to CPU5
[153674.799635] MAIN 0 kernel - IRQ245 no longer affine to CPU5
[153674.799701] MAIN 0 kernel - IRQ565 no longer affine to CPU5
[153674.799763] MAIN 0 kernel - process 229086 (selfdrive.contr) no longer affine to cpu5
[153674.810485] MAIN 0 kernel - process 229087 (selfdrive.contr) no longer affine to cpu5
[153676.025024] MAIN 0 kernel - process 229031 (camerad) no longer affine to cpu6
[153676.025519] MAIN 0 kernel - process 229109 (RoadCamera) no longer affine to cpu6
[153676.025839] MAIN 0 kernel - process 229039 (camerad) no longer affine to cpu6
[153676.025985] MAIN 0 kernel - process 229110 (WideRoadCamera) no longer affine to cpu6
[153676.026148] MAIN 0 kernel - process 229108 (DriverCamera) no longer affine to cpu6
[153676.026302] MAIN 0 kernel - process 229107 (camerad) no longer affine to cpu6
[153676.096124] MAIN 0 kernel - process 229036 (ZMQbg/IO/0) no longer affine to cpu6
[153676.637263] MAIN 0 kernel - process 229047 (_modeld) no longer affine to cpu7

Unfortunately, your device is one of the few affected by https://github.com/commaai/openpilot/pull/25959.

How many is a few? Rather than spend time on software, does it make sense for me to send this device in for rework?

jyoung8607 avatar May 16 '23 14:05 jyoung8607

@adeebshihadeh do you still consider this an open issue? All the misbehaviors probably trace back to the CPU hotplug thermal mitigation, and several recent updates to thermald make it less likely we'll reach that point.

jyoung8607 avatar Jun 19 '23 13:06 jyoung8607

Yes, I’d still like to handle this specific scenario better.

adeebshihadeh avatar Jun 19 '23 14:06 adeebshihadeh

Thought about it more, and don't think a specific check makes sense. We already check the things that matter (processes lagging, crashing, etc.)

adeebshihadeh avatar Jan 09 '24 01:01 adeebshihadeh