linux NCSI kernel warnings

dev-4.7 7b7b6e87d3d94dd7813f5ef288d2be9c890dce6d on a Palmetto

[19919.800000] ftgmac100 1e660000.ethernet eth0: NCSI interface down
[19919.800000] ------------[ cut here ]------------
[19919.800000] WARNING: CPU: 0 PID: 125 at net/ncsi/ncsi-manage.c:221 ncsi_start_channel_monitor+0x48/0x94
[19919.800000] Modules linked in:
[19919.800000] CPU: 0 PID: 125 Comm: kworker/0:1 Not tainted 4.7.2 #2
[19919.800000] Hardware name: ASpeed SoC
[19919.800000] Workqueue: events ncsi_dev_work
[19919.800000] [<c01077b4>] (unwind_backtrace) from [<c010539c>] (show_stack+0x10/0x14)
[19919.800000] [<c010539c>] (show_stack) from [<c010f5f4>] (__warn+0xdc/0xf8)
[19919.800000] [<c010f5f4>] (__warn) from [<c010f704>] (warn_slowpath_null+0x1c/0x24)
[19919.800000] [<c010f704>] (warn_slowpath_null) from [<c03b6254>] (ncsi_start_channel_monitor+0x48/0x94)
[19919.800000] [<c03b6254>] (ncsi_start_channel_monitor) from [<c03b7234>] (ncsi_configure_channel+0x274/0x2b4)
[19919.800000] [<c03b7234>] (ncsi_configure_channel) from [<c03b7990>] (ncsi_dev_work+0x3a8/0x3d8)
[19919.800000] [<c03b7990>] (ncsi_dev_work) from [<c0122bc0>] (process_one_work+0x228/0x404)
[19919.800000] [<c0122bc0>] (process_one_work) from [<c01239cc>] (worker_thread+0x290/0x3e0)
[19919.800000] [<c01239cc>] (worker_thread) from [<c0127ea4>] (kthread+0xd0/0xe4)
[19919.800000] [<c0127ea4>] (kthread) from [<c01024b0>] (ret_from_fork+0x14/0x24)
[19919.800000] ---[ end trace 86be865441a71a89 ]---
[19919.800000] ftgmac100 1e660000.ethernet eth0: NCSI interface up

Sep 06 '16 10:09 shenki

unreferenced object 0xdd6a2300 (size 192):
  comm "softirq", pid 0, jiffies 36737 (age 732.540s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 70 5e de 00 00 00 00 00 00 00 00  .....p^.........
  backtrace:
    [<c02f92dc>] __build_skb+0x28/0x8c
    [<c02f9440>] __netdev_alloc_skb+0x9c/0xfc
    [<c02cc53c>] ftgmac100_poll+0x19c/0x5c0
    [<c0304d4c>] net_rx_action+0xec/0x2d0
    [<c0112658>] __do_softirq+0xc0/0x1f0
    [<c01129e0>] irq_exit+0x84/0xe8
    [<c013e7a0>] __handle_domain_irq+0x84/0xa0
    [<c010148c>] avic_handle_irq+0x5c/0x64
    [<c0105eb0>] __irq_svc+0x50/0x64
    [<c03c5a24>] _raw_spin_unlock_irqrestore+0x28/0x2c
    [<c03c5a24>] _raw_spin_unlock_irqrestore+0x28/0x2c
    [<c01af318>] scan_gray_list+0x9c/0x148
    [<c01af800>] kmemleak_scan+0x21c/0x424
    [<c01afdb0>] kmemleak_scan_thread+0x64/0xb8
    [<c0128510>] kthread+0xd0/0xe8
    [<c01024b0>] ret_from_fork+0x14/0x24

Sep 06 '16 11:09 shenki

Joel, please help to provide more info on how to recreate the issue. Sometimes, this issue (the unexpected backtrace) happens when bringing down the network interface and then bring it up afterwards. It's known issue and I'm introduce a new NCSI API (ncsi_stop_dev()) to fix it.

Sep 07 '16 04:09 gwshan

The first issue I saw twice in a row when performing a pflash operation. This is a long running userspace task that does not involve the kernel all that much, aside from reading 32MB from a tmpfs.

Sep 07 '16 04:09 shenki

Thanks, Joel. There is a monitoring timer (for every NCSI channel) running in back-off period. It's monitor the link status. If the timer cannot be served in time, the NCSI link is considered as down, then you see below message. The timer is restarted and expect the link will come back soon. It seems the timer isn't served in time and the link is considered down though it's up in hardware.

ftgmac100 1e660000.ethernet eth0: NCSI interface down

Firstly, I need to reproduce it. There are two separate issues: (A) avoid the unexpected backtrace; (B) Improve the reliability of the monitoring timer so that reported link status is (as much) closed to the one in hardware.

Sep 08 '16 06:09 gwshan