update: key too big for map: argument list too long: unknown

Open 113xiaoji opened this issue 1 year ago • 0 comments

After running for a while, the containerd logs are continuously reporting an error:

level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:kube-scheduler-master1,Uid:4bba31f5bbd08c1ecb43f3eeca03effb,Namespace:kube-system,Attempt:221,} failed, error" error="failed to create containerd task: failed to create init process: failed to insert taskinfo for init process(id=5585c9eb3702e459fb2c73b0314e2d77670df6af8b23b0662c4032e7e328af1a, namespace=k8s.io): update: key too big for map: argument list too long: unknown"

It appears that the error is occurring during the update of an eBPF map. The following Go code seems to be involved in the issue:

// traceInitProcess checks init process is alive and starts to trace it's exit
// event by exitsnoop bpf tracepoint.
func (m *monitor) traceInitProcess(init *initProcess) (retErr error) {
	m.Lock()
	defer m.Unlock()

	fd, err := pidfd.Open(uint32(init.Pid()), 0)
	if err != nil {
		return fmt.Errorf("failed to open pidfd for %s: %w", init, err)
	}
	defer func() {
		if retErr != nil {
			unix.Close(int(fd))
		}
	}()

	// NOTE: The pid might be reused before pidfd.Open(like oom-killer or
	// manually kill), so that we need to check the runc-init's exec.fifo
	// file descriptor which is the "identity" of runc-init. :)
	//
	// Why we don't use runc-state commandline?
	//
	// The runc-state command only checks /proc/$pid/status's starttime,
	// which is not reliable. And then it only checks exec.fifo exist in
	// disk, but the runc-init has been killed. So we can't just use it.
	if err := checkRuncInitAlive(init); err != nil {
		return err
	}

	nsInfo, err := getPidnsInfo(uint32(init.Pid()))
	if err != nil {
		return fmt.Errorf("failed to get pidns info: %w", err)
	}

	if err := m.initStore.Trace(uint32(init.Pid()), &exitsnoop.TaskInfo{
		TraceID:   init.traceEventID,
		PidnsInfo: nsInfo,
	}); err != nil {
		return fmt.Errorf("failed to insert taskinfo for %s: %w", init, err)
	}
	defer func() {
		if retErr != nil {
			m.initStore.DeleteTracingTask(uint32(init.Pid()))
			m.initStore.DeleteExitedEvent(init.traceEventID)
		}
	}()

	// Before trace it, the init-process might be killed and the exitsnoop
	// tracepoint will not work, we need to check it alive again by pidfd.
	if err := fd.SendSignal(0, 0); err != nil {
		return err
	}

	if err := m.pidPoller.Add(fd, func() error {
		// TODO(fuweid): do we need to check the pid value in event?
		status, err := m.initStore.GetExitedEvent(init.traceEventID)
		if err != nil {
			init.SetExited(unexpectedExitCode)
			return fmt.Errorf("failed to get exited status: %w", err)
		}

		init.SetExited(int(status.ExitCode))
		return nil
	}); err != nil {
		return err
	}
	return nil
}

It seems that the key is not being validated properly. The key 5585c9eb3702e459fb2c73b0314e2d77670df6af8b23b0662c4032e7e328af1a is just an example, and there are other keys that also fail, such as 1ea7f8369914d19bda8da29673e4f4e037c1b39e185f6f4da0dc167539754ca2, 578193dfea54c854054abdea0a7bea11ab99e35a8d89c6469ed28084d5ab5080.

Jan 22 '24 02:01 113xiaoji