bcc icon indicating copy to clipboard operation
bcc copied to clipboard

tools/tcpdrop.py: I was running this tools but I got a 0 PID. Is it a bug?

Open linrl3 opened this issue 3 years ago • 13 comments

image

linrl3 avatar Jun 01 '21 02:06 linrl3

Hi @linrl3

I see that secondary_startup function in the stack. Maybe this is a swapper process and the PID is 0.

chenyuezhou avatar Jun 01 '21 03:06 chenyuezhou

Yes, I add comm column to the output, and it shows that the process with PID 0 is named swapper.

So it's not a bug. But I wonder why swapper interact with network stack.

chenhengqi avatar Jun 01 '21 15:06 chenhengqi

This is indeed a strange case. Typically, the networking softirq is processed by kernel ksoftirqd kthread which has a non-zero kernel pid. But in this case, I think it happens due to the following code,

static inline void invoke_softirq(void)
{
        if (ksoftirqd_running(local_softirq_pending()))
                return;

        if (!force_irqthreads || !__this_cpu_read(ksoftirqd)) {
#ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
                /*
                 * We can safely execute softirq on the current stack if
                 * it is the irq stack, because it should be near empty
                 * at this stage.
                 */
                __do_softirq();
#else
                /*
                 * Otherwise, irq_exit() is called on the task stack that can
                 * be potentially deep already. So call softirq in its own stack
                 * to prevent from any overrun.
                 */
                do_softirq_own_stack();
#endif
        } else {
                wakeup_softirqd();
        }
}

Looks like the following kernel options can enable softirq handled immediately after a hardware interrupt without calling ksoftirqd.

        threadirqs      [KNL]
                        Force threading of all interrupt handlers except those
                        marked explicitly IRQF_NO_THREAD.

Another possibility is you do not have ksoftirqd running on a particular cpu, but a softirq is somehow triggered, so you have to run it... but this is less likely.

yonghong-song avatar Jun 02 '21 05:06 yonghong-song

@yonghong-song I also found the 0 pid issue in another tools like tcprtt after I print the pid by bpf_get_current_pid_tgid.

linrl3 avatar Jun 02 '21 06:06 linrl3

Kernel handles softirq by two paths, one is irq_exit()->invoke_softirq, another one is ksoftirqd -> invoke_softirq PID = 0 is a normal situation because hardware IRQ interrupts the system during CPU is idle. Then bpf_get_current_pid_tgid will be 0.

netedwardwu avatar Jun 02 '21 06:06 netedwardwu

@netedwardwu You are right. It is a normal path. My previous statement about the impact with kernel option threadirqs is incorrect. That kernel options will set force_irqthreads to be true. But the default value force_irqthreads will be false so by default, __do_softirq will be called after hardware interrupt if there is no existing softirqd running (being interrupted).

yonghong-song avatar Jun 02 '21 06:06 yonghong-song

@yonghong-song I also found the 0 pid issue in another tools like tcprtt after I print the pid by bpf_get_current_pid_tgid.

@linrl3 this should be the same case as tcpdrop.py, tcp_rcv_established is called by softirq (packet recv path).

yonghong-song avatar Jun 02 '21 06:06 yonghong-song

@yonghong-song In this case, tcp_rcv_established is called by softirq. If it's not a Idle process, is the pid we captured the same as the pid who create the tcp connect?

linrl3 avatar Jun 02 '21 09:06 linrl3

@linrl3 , NO! The PID captured from softirq is worthless, you can test with one process only drain all CPU resources but not network jobs. Another process do network jobs, then you will find out all PID captured from softirq is the drain CPU process

netedwardwu avatar Jun 02 '21 10:06 netedwardwu

@netedwardwu Yes, I run a cpu drained process and a bcc program which probe tcp_rcv_established . I found most of the PID captured is the pid of the cpu drained process.

linrl3 avatar Jun 02 '21 12:06 linrl3

tcpretransmit exhibits the same behavior with both kprobe and tracepoint flavors. In most cases, the PID is 0, in some cases the PID is from a completely unrelated process.

Unrelated process example: I'm running telent to 10.10.10.10:23 (nothing there), most tcpretransmit will show PID 0, but some will show a PID for firefox (firefox is not attempting to connect to 10.10.10.10, nor anything on port 23).

michaelgugino avatar Jul 08 '21 15:07 michaelgugino

@michaelgugino That is true. A lot of retransmit requests are handled by kernel tcp_write_timer which is executed in softirq context. The underlying process could be idle thread (PID 0) or any other thread.

yonghong-song avatar Jul 11 '21 23:07 yonghong-song

@michaelgugino That is true. A lot of retransmit requests are handled by kernel tcp_write_timer which is executed in softirq context. The underlying process could be idle thread (PID 0) or any other thread.

Also happend when used tcp_set_state. The case was that I want to get the user-space Pid ,so I used the current->pid.But I got the wrong Pid that is completely unrelated. So, is there a way to get the correct user-space Pid and Comm?

mioyui88 avatar Aug 22 '22 14:08 mioyui88