bmcweb QEMU:Romulus: 500 returned on /redfish/v1/Systems/system/ during CI testing

trafficstars

Is this the right place to submit this?

[X] This is not a security vulnerability or a crashing bug
[X] This is not a question about how to use OpenBMC
[X] This is not a bug in an OpenBMC fork or a bug in code still under code review.
[X] This is not a request for a new feature.

Bug Description

With the latest master we are seeing fairly consistent failures on 2 of our jenkins nodes (builder_google_c0 and builder_google_c1) when running the QEMU Romulus CI portion of the testing (https://jenkins.openbmc.org/job/CI-MISC/job/run-ci-in-qemu/)

The issue we are hitting seems to be timing related so not really specific to the software levels on the jenkins node, just how things boot up during testing.

The test team recently added some additional tests to CI which seemed to coincide with us starting to hit this: https://gerrit.openbmc.org/c/openbmc/openbmc-test-automation/+/67377

Here's the failure seen in CI:

#(UTC) 2023/11/07 17:28:35.342271 -    0.583137 - Executing: get('/redfish/v1/Managers/bmc')
#(UTC) 2023/11/07 17:29:00.528295 -   25.186024 - Executing: get('/redfish/v1/Chassis/chassis')
#(UTC) 2023/11/07 17:29:00.665010 -    0.136715 - Executing: get('/redfish/v1/Systems/system/')
state:
  [redfish]:                                      0
  [ping]:                                         1
  [uptime]:                                       155.82
#(UTC) 2023/11/07 17:29:01.260796 -    0.595786 - **ERROR** redfish commands to the BMC are failing.#(UTC) 2023/11/07 17:29:01.264644 -    0.003848 - Collecting FFDC.
...
GET Redfish Resources With Login :: Login to BMCweb and GET valid ... | [31mFAIL[0m |
ValueError: The HTTP status code was not valid:
status:                                           500
valid_status_codes:
  [0]:                                            200

Here's the relevant logs from bmcweb:

Nov 07 17:28:01 romulus bmcweb[209]: [INFO http_connection.hpp:239] Request:  0x2617160 HTTP/1.1 PATCH /redfish/v1/Managers/bmc/NetworkProtocol ::ffff:172.17.0.3
Nov 07 17:28:01 romulus phosphor-time-manager[366]: Time mode has been changed to xyz.openbmc_project.Time.Synchronization.Method.Manual
Nov 07 17:28:01 romulus bmcweb[209]: [INFO complete_response_fields.hpp:23] Response:  /redfish/v1/Managers/bmc/NetworkProtocol 204
Nov 07 17:28:01 romulus bmcweb[209]: [INFO http_connection.hpp:239] Request:  0x2617160 HTTP/1.1 GET /redfish/v1/Managers/bmc ::ffff:172.17.0.3
Nov 07 17:28:01 romulus systemd-timedated[373]: systemd-timesyncd.service: Disabling unit.
Nov 07 17:28:01 romulus bmcweb[209]: [INFO complete_response_fields.hpp:23] Response:  /redfish/v1/Managers/bmc 200
Nov 07 17:28:01 romulus bmcweb[209]: [INFO http_connection.hpp:239] Request:  0x2617160 HTTP/1.1 PATCH /redfish/v1/Managers/bmc/NetworkProtocol ::ffff:172.17.0.3
Nov 07 17:28:02 romulus phosphor-network-manager[259]: Wrote networkd file: /etc/systemd/network/00-bmc-eth0.network
Nov 07 17:28:02 romulus bmcweb[209]: [INFO complete_response_fields.hpp:23] Response:  /redfish/v1/Managers/bmc/NetworkProtocol 204
Nov 07 17:28:02 romulus bmcweb[209]: [INFO http_connection.hpp:239] Request:  0x2617160 HTTP/1.1 PATCH /redfish/v1/Managers/bmc/NetworkProtocol ::ffff:172.17.0.3
Nov 07 17:28:02 romulus bmcweb[209]: [INFO complete_response_fields.hpp:23] Response:  /redfish/v1/Managers/bmc/NetworkProtocol 204
Nov 07 17:28:02 romulus bmcweb[209]: [INFO http_connection.hpp:239] Request:  0x2617160 HTTP/1.1 PATCH /redfish/v1/Managers/bmc/NetworkProtocol ::ffff:172.17.0.3
Nov 07 17:28:02 romulus bmcweb[209]: [INFO complete_response_fields.hpp:23] Response:  /redfish/v1/Managers/bmc/NetworkProtocol 204
Nov 07 17:28:02 romulus bmcweb[209]: [INFO http_connection.hpp:239] Request:  0x2617160 HTTP/1.1 DELETE /redfish/v1/SessionService/Sessions/nGCcc1bscD ::ffff:172.17.0.3
Nov 07 17:28:02 romulus bmcweb[209]: [INFO complete_response_fields.hpp:23] Response:  /redfish/v1/SessionService/Sessions/nGCcc1bscD 200
Nov 07 17:28:02 romulus bmcweb[209]: [ERROR http_connection.hpp:429] 0x2617160 Error while reading: stream truncated
Nov 07 17:28:02 romulus bmcweb[209]: [INFO http_connection.hpp:239] Request:  0x2633810 HTTP/1.1 GET /redfish/v1/ ::ffff:172.17.0.3
Nov 07 17:28:02 romulus bmcweb[209]: [INFO complete_response_fields.hpp:23] Response:  /redfish/v1/ 200
Nov 07 17:28:02 romulus bmcweb[209]: [INFO http_connection.hpp:239] Request:  0x2633810 HTTP/1.1 POST /redfish/v1/SessionService/Sessions ::ffff:172.17.0.3
Nov 07 17:28:03 romulus bmcweb[209]: [INFO complete_response_fields.hpp:23] Response:  /redfish/v1/SessionService/Sessions 201
Nov 07 17:28:03 romulus bmcweb[209]: [INFO http_connection.hpp:239] Request:  0x2633810 HTTP/1.1 DELETE /redfish/v1/SessionService/Sessions/tLJWNhSgy3 ::ffff:172.17.0.3
Nov 07 17:28:03 romulus bmcweb[209]: [INFO complete_response_fields.hpp:23] Response:  /redfish/v1/SessionService/Sessions/tLJWNhSgy3 200
Nov 07 17:28:03 romulus bmcweb[209]: [INFO http_connection.hpp:239] Request:  0x2653e50 HTTP/1.1 GET /redfish/v1/SessionService ::ffff:172.17.0.3
Nov 07 17:28:03 romulus bmcweb[209]: [WARNING http_connection.hpp:260] Authentication failed
Nov 07 17:28:03 romulus bmcweb[209]: [INFO complete_response_fields.hpp:23] Response:  /redfish/v1/SessionService 401
Nov 07 17:28:03 romulus bmcweb[209]: [INFO http_connection.hpp:239] Request:  0x2653e50 HTTP/1.1 POST /redfish/v1/SessionService/Sessions ::ffff:172.17.0.3
Nov 07 17:28:04 romulus bmcweb[209]: [INFO complete_response_fields.hpp:23] Response:  /redfish/v1/SessionService/Sessions 201
Nov 07 17:28:04 romulus bmcweb[209]: [INFO http_connection.hpp:239] Request:  0x2653e50 HTTP/1.1 GET /redfish/v1/SessionService ::ffff:172.17.0.3
Nov 07 17:28:04 romulus bmcweb[209]: [INFO complete_response_fields.hpp:23] Response:  /redfish/v1/SessionService 200
Nov 07 17:28:04 romulus bmcweb[209]: [INFO http_connection.hpp:239] Request:  0x2653e50 HTTP/1.1 GET /redfish/v1/AccountService ::ffff:172.17.0.3
Nov 07 17:28:04 romulus systemd[1]: Reloading requested from client PID 373 ('systemd-timedat') (unit systemd-timedated.service)...
Nov 07 17:28:04 romulus systemd[1]: Reloading...
Nov 07 17:28:04 romulus bmcweb[209]: [INFO complete_response_fields.hpp:23] Response:  /redfish/v1/AccountService 200
Nov 07 17:28:04 romulus bmcweb[209]: [INFO http_connection.hpp:239] Request:  0x2653e50 HTTP/1.1 GET /redfish/v1/Systems/system ::ffff:172.17.0.3
Nov 07 17:28:04 romulus systemd[1]: Watchdog: setting pretimeout_governor to 'panic' via '/sys/dev/char/247:0/pretimeout_governor'
Nov 07 17:28:04 romulus systemd[1]: Failed to set pretimeout_governor to 'panic': No such file or directory
Nov 07 17:28:04 romulus systemd[1]: Failed to set watchdog pretimeout governor to 'panic', ignoring: No such file or directory
Nov 07 17:28:05 romulus phosphor-network-manager[259]: Reloaded systemd-networkd
Nov 07 17:28:05 romulus systemd-networkd[157]: eth0: found matching network '/etc/systemd/network/00-bmc-eth0.network', based on potentially unpredictable interface name.
Nov 07 17:28:05 romulus systemd-networkd[157]: eth0: Reconfiguring with /etc/systemd/network/00-bmc-eth0.network.
Nov 07 17:28:05 romulus systemd-networkd[157]: eth0: DHCP lease lost
Nov 07 17:28:05 romulus avahi-daemon[208]: Withdrawing address record for 10.0.2.15 on eth0.
Nov 07 17:28:05 romulus systemd-networkd[157]: eth0: DHCPv6 lease lost
Nov 07 17:28:05 romulus avahi-daemon[208]: Leaving mDNS multicast group on interface eth0.IPv4 with address 10.0.2.15.
Nov 07 17:28:05 romulus avahi-daemon[208]: Interface eth0.IPv4 no longer relevant for mDNS.
Nov 07 17:28:05 romulus avahi-daemon[208]: Withdrawing address record for fec0::58b7:8bff:fedf:7311 on eth0.
Nov 07 17:28:05 romulus avahi-daemon[208]: Leaving mDNS multicast group on interface eth0.IPv6 with address fec0::58b7:8bff:fedf:7311.
Nov 07 17:28:05 romulus avahi-daemon[208]: Joining mDNS multicast group on interface eth0.IPv6 with address fe80::58b7:8bff:fedf:7311.
Nov 07 17:28:05 romulus avahi-daemon[208]: Registering new address record for fe80::58b7:8bff:fedf:7311 on eth0.*.
Nov 07 17:28:06 romulus nscd[218]: 218 ignored inotify event for `/etc/resolv.conf` (file exists)
Nov 07 17:28:06 romulus nscd[218]: 218 ignored inotify event for `/etc/resolv.conf` (file exists)
Nov 07 17:28:06 romulus systemd-networkd[157]: eth0: DHCPv4 address 10.0.2.15/24, gateway 10.0.2.2 acquired from 10.0.2.2
Nov 07 17:28:06 romulus avahi-daemon[208]: Joining mDNS multicast group on interface eth0.IPv4 with address 10.0.2.15.
Nov 07 17:28:06 romulus avahi-daemon[208]: New relevant interface eth0.IPv4 for mDNS.
Nov 07 17:28:06 romulus avahi-daemon[208]: Registering new address record for 10.0.2.15 on eth0.IPv4.
Nov 07 17:28:07 romulus avahi-daemon[208]: Leaving mDNS multicast group on interface eth0.IPv6 with address fe80::58b7:8bff:fedf:7311.
Nov 07 17:28:07 romulus avahi-daemon[208]: Joining mDNS multicast group on interface eth0.IPv6 with address fec0::58b7:8bff:fedf:7311.
Nov 07 17:28:07 romulus avahi-daemon[208]: Registering new address record for fec0::58b7:8bff:fedf:7311 on eth0.*.
Nov 07 17:28:07 romulus avahi-daemon[208]: Withdrawing address record for fe80::58b7:8bff:fedf:7311 on eth0.
Nov 07 17:28:23 romulus systemd[1]: /usr/lib/systemd/system/org.open_power.OCC.Control.service:7: Failed to add dependency on xyz.openbmc_project.PLDM, ignoring: Invalid argument
Nov 07 17:28:23 romulus systemd[1]: /usr/lib/systemd/system/org.open_power.OCC.Control.service:8: Failed to add dependency on xyz.openbmc_project.PLDM, ignoring: Invalid argument
Nov 07 17:28:24 romulus systemd[1]: /usr/lib/systemd/system/[email protected]:6: Invalid interface name, ignoring: sys-subsystem-net-devices-%i.device
Nov 07 17:28:26 romulus phosphor-time-manager[366]: Failed to update NTP setting: sd_bus_call noreply: org.freedesktop.DBus.Error.Timeout: Connection timed out
Nov 07 17:28:26 romulus phosphor-time-manager[366]: Time mode has been changed to xyz.openbmc_project.Time.Synchronization.Method.NTP
Nov 07 17:28:29 romulus systemd-timedated[373]: systemd-timesyncd.service: Stopping unit: Invalid argument
Nov 07 17:28:29 romulus systemd[1]: Reloading finished in 25292 ms.
Nov 07 17:28:32 romulus bmcweb[209]: [ERROR redfish_util.hpp:113] Connection timed out [generic:110]
Nov 07 17:28:32 romulus bmcweb[209]: [ERROR systems.hpp:3170] DBUS response error Connection timed out [generic:110]
Nov 07 17:28:32 romulus bmcweb[209]: [CRITICAL error_messages.cpp:284] Internal Error /usr/src/debug/bmcweb/1.0+git-r0/redfish-core/lib/systems.hpp(3171:32) `void redfish::afterPortRequest(const std::shared_ptr<bmcweb::AsyncResp>&, const boost::system::error_code&, const

The issue appears to be that the test case changes the BMC time from NTP to manual, resulting in systemd-timedated doing a full systemctl daemon-reload. During that reload the test case queries a redfish endpoint that requires a systemd d-bus API. That API times out (due to the ongoing reload) and an InternalError goes back to the test case causing a failure.

Version

https://jenkins.openbmc.org/job/CI-MISC/job/run-ci-in-qemu/15320/console is an example run of this. It was running against the most recent master at the time (https://jenkins.openbmc.org/job/latest-master/label=docker-builder,target=romulus/lastSuccessfulBuild/artifact/openbmc/build/tmp/deploy/images/romulus/obmc-phosphor-image-romulus.static.mtd)

Additional Information

A full journal can be found at https://jenkins.openbmc.org/job/CI-MISC/job/run-ci-in-qemu/15320/artifact/logs/20231107172901269852_RedfishIpmiRedfishServiceRootTestServiceRoot/20231107172901269852_GETRedfishResourcesWithLogin/20231107172901269852_BMC_journalctl_nopager.txt (until it's wrapped out). A new one can be generated if we lose it.

We can ask the test team to put a delay in, or retry, but it does seem like a valid bug. I'm just not sure what the fix would be. Retry on systemd timeouts? Query systemd status? It would be nice if instead of a internalerror, we return a "retry in 10s" type response on this type of error. If systemd is really down then the system is going to have bigger issues.

Nov 10 '23 14:11 geissonator

That API times out (due to the ongoing reload) and an InternalError goes back to the test case causing a failure.

So, this is some form of systemd issue. Not sure that bmcweb is the right place to hold this bug, but I'm happy to move it over if that's what we decide.

We can ask the test team to put a delay in, or retry,

This doesn't' sound right to me. User code won't have delays or retries, so adding it in test code seems like we're just making the tests pass at the expense of users. The tests found a real failure (unfortunately an architecfture-level failure) and we need to fix it.

It would be nice if instead of a internalerror, we return a "retry in 10s"

bmcweb does that in some cases, where, for example, a firmware update is already in progress, but that's internal bmcweb limitations. So far as I know, DBus doesn't have a "try again later" error that's distinct from "internal failure".

IMO, we need to get to the bottom of why systemd is returning an error and solve that issue. I'm assuming we can get all the nodes or a real bmc to reproduce this failure by simply tweaking the timeout down and watching it fail. Doing a full systemctl daemon-reload seems like overkill when making a minor NTP change. I really suspect someone needs to profile what we're doing there and determine if: A. There's a less computationally expensive way to enable NTP (DBus property api?) B. If we're doing something weird that's causing daemon-reload to be much slower.

Nov 15 '23 17:11 edtanous

I was kind of hoping that with the recent subtree updates and test case refactoring we may get lucky and see this go away, but nope, it still consistently recreates on our google servers. Latest journal out at https://jenkins.openbmc.org/job/CI-MISC/job/run-ci-in-qemu/15486/artifact/logs/20231128165512979301_RedfishIpmiRedfishServiceRootTestServiceRoot/20231128165512979301_GETRedfishResourcesWithLogin/20231128165512979301_BMC_journalctl_nopager.txt

Nov 28 16:55:07 romulus systemd[1]: Reloading finished in 25706 ms.
Nov 28 16:55:10 romulus bmcweb[209]: [ERROR managers.hpp:2056] Error while getting progress
Nov 28 16:55:10 romulus bmcweb[209]: [CRITICAL error_messages.cpp:284] Internal Error /usr/src/debug/bmcweb/1.0+git/redfish-core/lib/managers.hpp(2057:40) `redfish::requestRoutesManager(App&)::<lambda(const crow::Request&, const std::shared_ptr<bmcweb::AsyncResp>&)>::<lambda(const boost::system::error_code&, double)>`:
Nov 28 16:55:10 romulus bmcweb[209]: [INFO complete_response_fields.hpp:23] Response:  /redfish/v1/Managers/bmc 500

Looking at what systemd is doing, here's the key events:

Nov 28 16:54:14 romulus systemd[1]: Startup finished in 6.534s (kernel) + 1min 54.759s (userspace) = 2min 1.294s.
...
Nov 28 16:54:35 romulus systemd[1]: Starting Time & Date Service...
...
Nov 28 16:54:38 romulus systemd[1]: Started Time & Date Service.
...
Nov 28 16:54:38 romulus systemd-timedated[531]: systemd-timesyncd.service: Disabling unit.
Nov 28 16:54:39 romulus phosphor-network-manager[262]: Reloaded systemd-networkd
...
Nov 28 16:54:41 romulus systemd[1]: Reloading requested from client PID 531 ('systemd-timedat') (unit systemd-timedated.service)...
Nov 28 16:54:41 romulus systemd[1]: Reloading..
...
# this looks to be what was hanging up the reload, or is this a sign systemd d-bus is hung?
Nov 28 16:55:00 romulus phosphor-time-manager[367]: Failed to update NTP setting: sd_bus_call noreply: org.freedesktop.DBus.Error.Timeout: Connection timed out
Nov 28 16:55:00 romulus phosphor-time-manager[367]: Time mode has been changed to xyz.openbmc_project.Time.Synchronization.Method.NTP
...
Nov 28 16:55:07 romulus systemd[1]: Reloading finished in 25706 ms.
Nov 28 16:55:10 romulus bmcweb[209]: [ERROR managers.hpp:2056] Error while getting progress
...

Nov 28 '23 17:11 geissonator

Unfortunately it appears we've still got this one. Only on the google provided CI machines for some reason, and only intermittently. We're not getting debug data on the fails (have ping out to test team on that) so I only have the console but similar symptoms, except now it's on the actual patch to the NetworkProtocol.

https://jenkins.openbmc.org/job/CI-MISC/job/run-ci-in-qemu/17314/console is our most recent fail.

#(UTC) 2024/04/19 20:14:45.744433 -    0.012757 - Executing: patch('/redfish/v1/Managers/bmc/NetworkProtocol', body="{'NTP':{'ProtocolEnabled': False}}", valid_...
Verify Redfish BMC Time :: Verify that date/time obtained via redf... 
                                                                              
Verify Redfish BMC Time :: Verify that date/time obtained via redf... | [31mFAIL[0m |
Parent suite setup failed:
ValueError: The HTTP status code was not valid:
status:                                           500
valid_status_codes:
  [0]:                                            200
  [1]:                                            204

Apr 19 '24 21:04 geissonator

bmcweb bmcweb copied to clipboard

QEMU:Romulus: 500 returned on /redfish/v1/Systems/system/ during CI testing

Is this the right place to submit this?

Bug Description

Version

Additional Information

bmcweb
bmcweb copied to clipboard