pixie icon indicating copy to clipboard operation
pixie copied to clipboard

pixie agent cause OS(Anolis OS 8.10)reboot issue

Open zgcmaradona opened this issue 3 months ago • 9 comments

Describe the bug we are trying to deploy PIXIE on our test environment,but It reboot as we deploy the pixie pem and vizier. FYI, if only deployed the Pixie Cloud, no reboot issue

PIXIE: 0.19 k8s cluster: Client Version: v1.23.0 Server Version: v1.23.0

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16

OS: NAME="Anolis OS" VERSION="8.10"

Additional context we have verified on Anolis OS 8.8, that was fine, so are there any compatibility issue for the pixie agent and Anolis OS 8.10? PIXIE: 0.19 k8s cluster: Client Version: v1.23.0 Server Version: v1.23.0

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16

OS: NAME="Anolis OS" VERSION="8.8"

zgcmaradona avatar Sep 11 '25 06:09 zgcmaradona

Image

fqsuncn avatar Sep 11 '25 06:09 fqsuncn

We aren't aware of any issues with Anolis OS, but this is the first I'm hearing of it. There also aren't any known issues that cause other OS's to reboot.

Can you provide the dmesg logs from a rebooted instance? In addition to that, if you can capture the logs of the pods in the pl namespace that would be helpful. Normally I'd suggest using px collect-logs, but that will probably be difficult given the instance is shutting down.

ddelnano avatar Sep 11 '25 16:09 ddelnano

dmesg.txt

fqsuncn avatar Sep 12 '25 07:09 fqsuncn

dmesg2.txt

fqsuncn avatar Sep 12 '25 09:09 fqsuncn

journal.zip

fqsuncn avatar Sep 15 '25 07:09 fqsuncn

in the system log, I can see below errors after each reboot: Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:18.7: bridge window [io 0x1000-0x0fff] to [bus 22] add_size 1000 Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:15.0: BAR 15: assigned [mem 0xc0000000-0xc01fffff 64bit pref] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:16.0: BAR 15: assigned [mem 0xc0200000-0xc03fffff 64bit pref] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:0f.0: BAR 6: assigned [mem 0xc0400000-0xc0407fff pref] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:15.3: BAR 13: no space for [io size 0x1000] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:15.3: BAR 13: failed to assign [io size 0x1000] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:15.4: BAR 13: no space for [io size 0x1000] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:15.4: BAR 13: failed to assign [io size 0x1000] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:15.5: BAR 13: no space for [io size 0x1000] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:15.5: BAR 13: failed to assign [io size 0x1000] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:15.6: BAR 13: no space for [io size 0x1000] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:15.6: BAR 13: failed to assign [io size 0x1000] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:15.7: BAR 13: no space for [io size 0x1000] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:15.7: BAR 13: failed to assign [io size 0x1000] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:16.3: BAR 13: no space for [io size 0x1000] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:16.3: BAR 13: failed to assign [io size 0x1000] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:16.4: BAR 13: no space for [io size 0x1000] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:16.4: BAR 13: failed to assign [io size 0x1000] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:16.5: BAR 13: no space for [io size 0x1000] Sep 14 06:35:07 localhost.localdomain kernel: pci 0000:00:16.5: BAR 13: failed to assign [io size 0x1000]

fqsuncn avatar Sep 15 '25 07:09 fqsuncn

I set the environment variable of: PL_STIRLING_SOURCES=kSocket, it should enable below connector only: case SourceConnectorGroup::kTracers: return { SocketTraceConnector::kName };

The OS still crashed during the night. So I suspect that socket_trace_connector.cc may be the root cause.

fqsuncn avatar Sep 18 '25 01:09 fqsuncn

2.log

fqsuncn avatar Sep 18 '25 01:09 fqsuncn

it is the vizier-pem log before the OS crash.

fqsuncn avatar Sep 18 '25 01:09 fqsuncn