zephyr icon indicating copy to clipboard operation
zephyr copied to clipboard

Issue with SMP commands sent over the UART

Open nordicjm opened this issue 3 years ago • 20 comments

Describe the bug We are using an nRF52840-based module in our design and a firmware based on the nRF connect SDK versions 1.5 to 1.7, in this we are seeing what appears to be an issue in zephyr 2.6 and 2.7 which was not present in 2.5. We have a shell configured, with logging also, on the UART (UARTE peripheral operating at 115200 baud) which is connected to a PC, we have the mcumgr shell transport added and have an automated test system which sends commands and checks responses which runs for about 30 minutes. When our project used nRF connect SDK 1.5 and zephyr 2.5, this was flawless and I've re-loaded that firmware and tested it and it seems perfect, however in our nRF connect SDK 1.6 version (and I've tried with the latest nRF connect SDK 1.7 which uses almost top of the tree of zephyr) I have a PC program connected to the unit which fires off an SMP command, waits for a response, then repeats, and it does this continuously - on the 2.5 version of the application, this runs for a long time with no issue, on the 2.6/2.7 version it works for a number of cycles and then stops, it stops because it does not get a response from the module. At this point, the test can be stopped and an enter send over the UART to which the expected response will then come back from the module, it can then be restarted and will then fail again within seconds. I've tried moving logging to RTT to see if that had any impact, it didn't. I've tried removing logging entirely, no difference. Tried moving from the SMP over shell interface to SMP over UART directly, seems to make it take longer to happen but it does still happen. I've tried applying https://github.com/zephyrproject-rtos/zephyr/commit/01a01de5947d3a72b7c3602e8564f8e6c88440d6#diff-ec8a0c586e8f69d0e76e1ddf8dc6b7f1ecc5c9208a4eba0b232f3332ca7d9980 to nRF connect SDK 1.7 (because it's not included) and it does not fix it.

Now in all times I have tested this, it seems to receive the command but does not give me a response, however one time I tested it seemed to just completely break the whole system, it would then no longer understand messages I sent in and replied with an rc = EINVAL value, I was only able to get out of this by rebooting the module.

So I've added debugging and what seems to happen is smp_shell_rx_bytes gets called in subsys/mgmt/mcumgr/smp_shell.c many times (one per character received), it gets into the "byte == '\n'" part after it gets the full command - at this point we have the difference in operation of when it works vs when it stops working, when this bug hits, that seems to be it, none of the SMP functions seem to be called, even though the full SMP command has been received. Prior to the bug happening, smp_process_request_packet is called in the external mcumgr library in modules/lib/mcumgr/smp/src/smp.c which then calls zephyr_smp_tx_rsp in subsys/mgmt/mcumgr/smp.c which then calls smp_shell_tx_pkt and then smp_shell_tx_raw.

To Reproduce Unfortunately we are testing this as a complete application so do not have a minimal example for reproduction

Expected behavior For the command to be successfully passed to mcumgr each time

Impact This depends, because I have seen this manifest in 2 ways, with one then respond only with an EINVALUE even to valid commands, this makes us pretty nervous and think that some kind of buffer overflow might be happening, if that's the cause then this is a showstopper because that would be very undesired behaviour, if that isn't the case then this is an annoyance

Logs and console output I added some debugging to the build but it's probably not going to look very good here so I apologise for that in advance:

  • AA = smp_shell_rx_bytes at "/* Newline in payload means complete frame */"
  • BB = smp_shell_rx_bytes at "byte == '\n') {"
  • 1 = smp_process_request_packet at "rc = mgmt_streamer_init_reader(&streamer->mgmt_stmr, req);"
  • 2 = smp_process_request_packet at "rc = smp_read_hdr(streamer, &req_hdr);" and the values are req_hdr.nh_len and rc
  • 3 = smp_process_request_packet at "rsp = mgmt_streamer_alloc_rsp(&streamer->mgmt_stmr, req);"
  • 4 = smp_process_request_packet at "rc = mgmt_streamer_init_writer(&streamer->mgmt_stmr, rsp);"
  • 5 = smp_process_request_packet at "rc = smp_handle_single_req(streamer, &req_hdr, &handler_found);"
  • 6 = smp_process_request_packet at "rc = streamer->tx_rsp_cb(streamer, rsp, streamer->mgmt_stmr.cb_arg);"
  • 7 = smp_process_request_packet at "mgmt_streamer_trim_front(&streamer->mgmt_stmr, req,"
  • 8 = smp_process_request_packet after "cmd_done_arg.err = MGMT_ERR_EOK;"
  • 9 = smp_process_request_packet after the mgmt_evt in "if (rc != 0 && valid_hdr)"
  • 99 = zephyr_smp_tx_rsp
  • QQ = smp_shell_tx_pkt at "mcumgr_serial_tx_pkt"
  • DD = smp_shell_tx_raw at "return 0"

This is one where it is working:

00> [00:00:21.524,230] <err> smp_shell: AA
00> [00:00:21.524,291] <err> smp_shell: AA
00> [00:00:21.524,383] <err> smp_shell: AA
00> [00:00:21.524,475] <err> smp_shell: AA
00> [00:00:21.524,566] <err> smp_shell: AA
00> [00:00:21.524,627] <err> smp_shell: AA
00> [00:00:21.524,719] <err> smp_shell: AA
00> [00:00:21.524,810] <err> smp_shell: AA
00> [00:00:21.524,902] <err> smp_shell: AA
00> [00:00:21.524,993] <err> smp_shell: AA
00> [00:00:21.525,085] <err> smp_shell: AA
00> [00:00:21.525,177] <err> smp_shell: AA
00> [00:00:21.525,238] <err> smp_shell: AA
00> [00:00:21.525,329] <err> smp_shell: AA
00> [00:00:21.525,421] <err> smp_shell: AA
00> [00:00:21.525,512] <err> smp_shell: AA
00> [00:00:21.525,604] <err> smp_shell: AA
00> [00:00:21.525,695] <err> smp_shell: AA
00> [00:00:21.525,848] <err> smp_shell: AA
00> [00:00:21.525,909] <err> smp_shell: AA
00> [00:00:21.525,939] <err> smp_shell: AA
00> [00:00:21.526,031] <err> smp_shell: AA
00> [00:00:21.526,123] <err> smp_shell: AA
00> [00:00:21.526,214] <err> smp_shell: AA
00> [00:00:21.526,275] <err> smp_shell: AA
00> [00:00:21.526,367] <err> smp_shell: AA
00> [00:00:21.526,458] <err> smp_shell: AA
00> [00:00:21.526,550] <err> smp_shell: AA
00> [00:00:21.526,641] <err> smp_shell: AA
00> [00:00:21.526,733] <err> smp_shell: AA
00> [00:00:21.526,794] <err> smp_shell: AA
00> [00:00:21.526,885] <err> smp_shell: AA
00> [00:00:21.526,977] <err> smp_shell: AA
00> [00:00:21.527,099] <err> smp_shell: AA
00> [00:00:21.527,160] <err> smp_shell: AA
00> [00:00:21.527,160] <err> smp_shell: BB
00> [00:00:21.551,208] <err> mcusmp: 1
00> [00:00:21.551,269] <err> mcusmp: 2 2816 0
00> [00:00:21.551,269] <err> mcusmp: 3
00> [00:00:21.551,330] <err> mcusmp: 4
00> [00:00:21.551,330] <err> mcusmp: 5
00> [00:00:21.552,093] <dbg> attr: Show: [061] configVersion                 223
00> [00:00:21.552,612] <err> mcusmp: 6
00> [00:00:21.552,642] <err> mcuoth: 99
00> [00:00:21.552,673] <err> smp_shell: QQ
00> [00:00:21.552,825] <err> smp_shell: DD
00> [00:00:21.553,192] <err> smp_shell: DD
00> [00:00:21.553,527] <err> smp_shell: DD
00> [00:00:21.553,863] <err> smp_shell: DD
00> [00:00:21.554,229] <err> smp_shell: DD
00> [00:00:21.554,565] <err> smp_shell: DD
00> [00:00:21.554,901] <err> smp_shell: DD
00> [00:00:21.555,236] <err> smp_shell: DD
00> [00:00:21.555,603] <err> smp_shell: DD
00> [00:00:21.555,938] <err> smp_shell: DD
00> [00:00:21.556,030] <err> smp_shell: DD
00> [00:00:21.556,060] <err> mcusmp: 7
00> [00:00:21.556,091] <err> mcusmp: 8
00> [00:00:21.556,121] <err> mcusmp: 1
00> [00:00:21.556,121] <err> mcusmp: 2 11 3

And this is the final part of the last test ran I did, where you can see it deals with one packet and then doesn't seem to handle the following one:

00> [00:00:22.379,608] <inf> Advertisement: Advertising Coded start (0)
00> [00:00:22.383,453] <err> smp_shell: AA
00> [00:00:22.383,514] <err> smp_shell: AA
00> [00:00:22.383,605] <err> smp_shell: AA
00> [00:00:22.383,697] <err> smp_shell: AA
00> [00:00:22.383,758] <err> smp_shell: AA
00> [00:00:22.383,850] <err> smp_shell: AA
00> [00:00:22.383,941] <err> smp_shell: AA
00> [00:00:22.384,033] <err> smp_shell: AA
00> [00:00:22.384,124] <err> smp_shell: AA
00> [00:00:22.384,216] <err> smp_shell: AA
00> [00:00:22.384,277] <err> smp_shell: AA
00> [00:00:22.384,368] <err> smp_shell: AA
00> [00:00:22.384,460] <err> smp_shell: AA
00> [00:00:22.384,552] <err> smp_shell: AA
00> [00:00:22.384,643] <err> smp_shell: AA
00> [00:00:22.384,735] <err> smp_shell: AA
00> [00:00:22.384,796] <err> smp_shell: AA
00> [00:00:22.384,887] <err> smp_shell: AA
00> [00:00:22.384,979] <err> smp_shell: AA
00> [00:00:22.385,070] <err> smp_shell: AA
00> [00:00:22.385,162] <err> smp_shell: AA
00> [00:00:22.385,253] <err> smp_shell: AA
00> [00:00:22.385,314] <err> smp_shell: AA
00> [00:00:22.385,406] <err> smp_shell: AA
00> [00:00:22.385,498] <err> smp_shell: AA
00> [00:00:22.385,589] <err> smp_shell: AA
00> [00:00:22.385,681] <err> smp_shell: AA
00> [00:00:22.385,772] <err> smp_shell: AA
00> [00:00:22.385,833] <err> smp_shell: AA
00> [00:00:22.385,925] <err> smp_shell: AA
00> [00:00:22.386,016] <err> smp_shell: AA
00> [00:00:22.386,108] <err> smp_shell: AA
00> [00:00:22.386,199] <err> smp_shell: AA
00> [00:00:22.386,291] <err> smp_shell: AA
00> [00:00:22.386,352] <err> smp_shell: AA
00> [00:00:22.386,383] <err> smp_shell: BB
00> [00:00:22.441,192] <err> mcusmp: 1
00> [00:00:22.441,253] <err> mcusmp: 2 2816 0
00> [00:00:22.441,253] <err> mcusmp: 3
00> [00:00:22.441,314] <err> mcusmp: 4
00> [00:00:22.441,345] <err> mcusmp: 5
00> [00:00:22.442,077] <dbg> attr: Show: [061] configVersion                 245
00> [00:00:22.442,626] <err> mcusmp: 6
00> [00:00:22.442,626] <err> mcuoth: 99
00> [00:00:22.442,657] <err> smp_shell: QQ
00> [00:00:22.442,840] <err> smp_shell: DD
00> [00:00:22.443,176] <err> smp_shell: DD
00> [00:00:22.443,511] <err> smp_shell: DD
00> [00:00:22.443,847] <err> smp_shell: DD
00> [00:00:22.444,213] <err> smp_shell: DD
00> [00:00:22.444,549] <err> smp_shell: DD
00> [00:00:22.444,885] <err> smp_shell: DD
00> [00:00:22.445,251] <err> smp_shell: DD
00> [00:00:22.445,587] <err> smp_shell: DD
00> [00:00:22.445,922] <err> smp_shell: DD
00> [00:00:22.446,014] <err> smp_shell: DD
00> [00:00:22.446,044] <err> mcusmp: 7
00> [00:00:22.446,075] <err> mcusmp: 8
00> [00:00:22.446,105] <err> mcusmp: 1
00> [00:00:22.446,105] <err> mcusmp: 2 11 3
00> [00:00:22.447,265] <inf> Advertisement: Advertising Coded end (0)
00> [00:00:22.447,998] <inf> Advertisement: update advertising data (0)
00> [00:00:22.448,516] <err> smp_shell: AA
00> [00:00:22.448,608] <err> smp_shell: AA
00> [00:00:22.448,669] <err> smp_shell: AA
00> [00:00:22.448,760] <err> smp_shell: AA
00> [00:00:22.448,852] <err> smp_shell: AA
00> [00:00:22.448,944] <err> smp_shell: AA
00> [00:00:22.449,035] <err> smp_shell: AA
00> [00:00:22.449,127] <err> smp_shell: AA
00> [00:00:22.449,218] <err> smp_shell: AA
00> [00:00:22.449,310] <inf> Advertisement: Advertising Coded start (0)
00> [00:00:22.449,279] <err> smp_shell: AA
00> [00:00:22.449,401] <err> smp_shell: AA
00> [00:00:22.449,493] <err> smp_shell: AA
00> [00:00:22.449,554] <err> smp_shell: AA
00> [00:00:22.449,676] <err> smp_shell: AA
00> [00:00:22.449,737] <err> smp_shell: AA
00> [00:00:22.449,798] <err> smp_shell: AA
00> [00:00:22.449,890] <err> smp_shell: AA
00> [00:00:22.449,981] <err> smp_shell: AA
00> [00:00:22.450,073] <err> smp_shell: AA
00> [00:00:22.450,164] <err> smp_shell: AA
00> [00:00:22.450,256] <err> smp_shell: AA
00> [00:00:22.450,317] <err> smp_shell: AA
00> [00:00:22.450,408] <err> smp_shell: AA
00> [00:00:22.450,500] <err> smp_shell: AA
00> [00:00:22.450,592] <err> smp_shell: AA
00> [00:00:22.450,683] <err> smp_shell: AA
00> [00:00:22.450,775] <err> smp_shell: AA
00> [00:00:22.450,866] <err> smp_shell: AA
00> [00:00:22.450,927] <err> smp_shell: AA
00> [00:00:22.451,019] <err> smp_shell: AA
00> [00:00:22.451,110] <err> smp_shell: AA
00> [00:00:22.451,202] <err> smp_shell: AA
00> [00:00:22.451,293] <err> smp_shell: AA
00> [00:00:22.451,385] <err> smp_shell: AA
00> [00:00:22.451,446] <err> smp_shell: AA
00> [00:00:22.451,477] <err> smp_shell: BB
00> [00:00:33.950,134] <dbg> lcz_param_file: lcz_param_file_validate_file: Found 89 pairs

After this, I stopped the test, and sent a single \n over the UART which then gave this:

00> [00:15:10.528,350] <err> mcusmp: 1
00> [00:15:10.528,411] <err> mcusmp: 2 2816 0
00> [00:15:10.528,442] <err> mcusmp: 3
00> [00:15:10.528,472] <err> mcusmp: 4
00> [00:15:10.528,503] <err> mcusmp: 5
00> [00:15:10.529,266] <dbg> attr: Show: [061] configVersion                 246
00> [00:15:10.529,785] <err> mcusmp: 6
00> [00:15:10.529,815] <err> mcuoth: 99
00> [00:15:10.529,846] <err> smp_shell: QQ
00> [00:15:10.529,998] <err> smp_shell: DD
00> [00:15:10.530,334] <err> smp_shell: DD
00> [00:15:10.530,700] <err> smp_shell: DD
00> [00:15:10.531,036] <err> smp_shell: DD
00> [00:15:10.531,372] <err> smp_shell: DD
00> [00:15:10.531,738] <err> smp_shell: DD
00> [00:15:10.532,073] <err> smp_shell: DD
00> [00:15:10.532,409] <err> smp_shell: DD
00> [00:15:10.532,745] <err> smp_shell: DD
00> [00:15:10.533,111] <err> smp_shell: DD
00> [00:15:10.533,203] <err> smp_shell: DD
00> [00:15:10.533,233] <err> mcusmp: 7
00> [00:15:10.533,264] <err> mcusmp: 8
00> [00:15:10.533,264] <err> mcusmp: 1
00> [00:15:10.533,294] <err> mcusmp: 2 11 3
00> [00:15:10.534,454] <inf> Advertisement: Advertising Coded end (0)

Environment (please complete the following information):

  • OS: Linux
  • Toolchain: Zephyr SDK 0.13.1 (for nRF connect SDK 1.6-1.7, the 1.5 build was either ARM GCC or SES, it was from quite some time ago)
  • Commit SHA or Version used: nRF connect SDK 1.6-1.7 which is zephyr 2.6-current head which has the bug and nRF connect SDK 1.5 which is zephyr 2.5 which does not have the bug

nordicjm avatar Sep 28 '21 17:09 nordicjm

Note that this is SMP but is not DFU, we are using SMP for a custom system which sets variables

nordicjm avatar Sep 28 '21 17:09 nordicjm

Further debugging has shown that smp_shell_process usually gets called 2 or 3 times per received message, normally it hits the if (!buf) { on the first loop (second loop also if it's a 3 loop one) and then hits if (nb != NULL) { on the final loop however on times where it gets stuck with this bug, all 3 times it hits the if (!buf) { part. Managed to get one with 11 loops on that just now also where it failed.

nordicjm avatar Sep 28 '21 17:09 nordicjm

@lairdjm do you plan to investigate this further? Assigning to you for now, since there's not enough here for others to investigate.

carlescufi avatar Nov 09 '21 10:11 carlescufi

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

github-actions[bot] avatar Jan 10 '22 00:01 github-actions[bot]

@jakub-uC @carlescufi I've got as far back as subsys/shell/shell.c in this issue in the shell_thread function, I have a workaround for it which resolves the issue but probably isn't a good fix but might help in tracking down where the issue is, if I change the k_poll to include the SHELL_SIGNAL_TXDONE event, I am no longer able to reproduce this issue, so from that it seems more likely to be a shell bug, not a mcumgr bug (all the mcumgr handlers are being fired, but they are subject to running when the shell tells them to run), and something related to the timings of polled events in the shell or perhaps timing issues between the UART driver and shell events?

nordicjm avatar Jan 20 '22 10:01 nordicjm

I've tried creating a minimal reproduction sample in the latest zephyr (d517947de4e1a1dcce9f5e7fe313e8f297320639) by adding random sensor and text output over the logger in the smp_svr sample, I cannot re-create the same issue as we observe in the full application but I have a program that runs through these steps:

send \06\09AAwAAAACAAABAr//aPQ=\0a
receive \06
receive \09
//receive \04\14
receive =
receive \0a

And loops continiously, without any log output, it runs flawlessly, with the log output, it will sometimes fail to receive the \06 and the output from the program shows:

\06	AIkBAAB/AAABAr9ldGFza3O/Yi0xv2RwcmlvGP9jdGlkAGVzdGF0ZRiA/2IxNL9kcHJpbw5jdGlkAWVzdGF0ZRiA/2IxNL9kcHJpbw5jdGlkAmVzdGF0ZQL/YjE1
\04\14v2RwcmlvD2N0aWQDZXN0YXRlAP9hML9kcHJpbwBjdGlkBGVzdGF0ZRD//\06	AAwAAAACAAABAr//aPQ=
/+uEA==
\06	AIkBAAB/AAABAr9ldGFza3O/Yi0xv2RwcmlvGP9jdGlkAGVzdGF0ZRiA/2IxNL9kcHJpbw5jdGlkAWVzdGF0ZRiA/2IxNL9kcHJpbw5jdGlkAmVzdGF0ZQL/YjE1
\04\14v2RwcmlvD2N0aWQDZXN0YXRlAP9hML9kcHJpbwBjdGlkBGVzdGF0ZRD//\06	AAwAAAACAAABAr//aPQ=
/+uEA==
\06	AIkBAAB/AAABAr9ldGFza3O/Yi0xv2RwcmlvGP9jdGlkAGVzdGF0ZRiA/2IxNL9kcHJpbw5jdGlkAWVzdGF0ZRiA/2IxNL9kcHJpbw5jdGlkAmVzdGF0ZQL/YjE1
\04\14v2RwcmlvD2N0aWQDZXN0YXRlAP9hML9kcHJpbwBjdGlkBGVzdGF0ZRD\06	AAwAAAACAAABAr//aPQ=
///+uEA==
\06	AIkBAAB/AAABAr9ldGFza3O/Yi0xv2RwcmlvGP9jdGlkAGVzdGF0ZRiA/2IxNL9kcHJpbw5jdGlkAWVzdGF0ZRiA/2IxNL9kcHJpbw5jdGlkAmVzdGF0ZQL/YjE1
\04\14v2RwcmlvD2N0aWQDZXN0YXRlAP9hML9kcHJpbwBjdGlkBGVzdGF0ZRD///+uEA==\06	AAwAAAACAAABAr//aPQ=

\06	AIkBAAB/AAABAr9ldGFza3O/Yi0xv2RwcmlvGP9jdGlkAGVzdGF0ZRiA/2IxNL9kcHJpbw5jdGlkAWVzdGF0ZRiA/2IxNL9kcHJpbw5jdGlkAmVzdGF0ZQL/YjE1
\04\14v2RwcmlvD2N0aWQDZXN0YXRlAP9hML9kcHJpbwBjdGlkBGVzdGF0ZRD//\06	AAwAAAACAAABAr//aPQ=
/+uEA==
[00:04:19.125,091] \1b[1;31m<err> test: Testing!!\1b[0m
[00:04:19.641,937] \1b[0m<inf> test: some here\1b[0m
[00:04:19.641,967] \1b[0m<inf> test: and here\1b[0m
[00:04:19.641,967] \1b[0m<inf> test: and this one is a tad bit longer\1b[0m
\06	AIkBAAB/AAABAr9ldGFza3O/Yi0xv2RwcmlvGP9jdGlkAGVzdGF0ZRiA/2IxNL9kcHJpbw5jdGlkAWVzdGF0ZRiA/2IxNL9kcHJpbw5jdGlkAmVzdGF0ZQL/YjE1
\04\14v2RwcmlvD2N0aWQDZXN0YXRlAP9hML9kcHJpbwBjdGlkBGVzdGF0ZRD///+\06	AAwAAAACAAABAr//aPQ=
uEA==
\1b[8D\1b[J[00:04:17.415,130] \1b[0m<inf> test: some here\1b[0m
\1b[1;32muart:~$ \1b	mAIkBAAB/AAABAr9ldGFza3O/Yi0xv2RwcmlvGP9jdGlkAGVzdGF0ZRiA/2IxNL9kcHJpbw5jdGlkAWVzdGF0ZRiA/2IxNL9kcHJpbw5jdGlkAmVzdGF0ZQL/YjE1
\04\14v2RwcmlvD2N0aWQDZXN0YXRlAP9hML9kcHJpbwBjdGlkBGVzdGF0ZRD///+uEA==
#275 @ 260414 ms: x -0.153216 , y 0.306432 , z -9.346176
#276 @ 261363 ms: x -0.153216 , y 0.306432 , z -9.652608

In the output above, the final mAIkBAAB segment seems to be missing the \06, I assume there is some contention or overwriting when SMP over shell and logging are being used? @nordic-krch any thoughts on this?

nordicjm avatar Feb 15 '22 09:02 nordicjm

I've tried every way I can think of to investigate this with various mutex locks and debugging, the current method I have is to add a mutex to shell_uart for all output functions and to also lock it from smp_shell before transmitting, I also lock the shell write mutex, this has added a corruption bug to the output sometimes but ignoring that for a moment, I am still occasionally seeing this in the output: \1b[1;32muart:~$ \1b[m\1b[8D\1b[J\1b[1;32muar\06t\12$\1b[mABgDAAAOAAABAL9hcmlibGFoIGJsYWj/pVs= Somehow, the 't' in uart is getting output after the '\06' from the smp handler, which given smp is taking and holding both mutexes should not be possible, so is there something wrong with the nordic UARTE driver in that it falsely believes the transmit buffer is empty when it is not?

nordicjm avatar Mar 17 '22 10:03 nordicjm

Upon further investigation this seems to be caused by multiple different bugs, the ones I've identified so far:

  1. SMP shell uses the uart_poll_out function directly, however shell_uart will only use that if interrupt driven UART is not enabled, therefore it should call the shell_uart write function which will either use uart_poll_out directly or it's own internal ring buffer
  2. Logging, like SMP, uses the uart_poll_out function directly, this is the cause of overwriting data in the UART output as it will trumple the shell's ring buffer
  3. None of the above are protected with mutexes either, so they're all free to start outputting during the middle of another one outputting
  4. ~~The issue with data being sent and not 'processed' until the UART outputs a byte is a strange issue that I'm not yet able to track down, with logging disabled, I'm finding it incredibly hard to reproduce also~~ With SMP and logging updated to use the UART shell write function, I've not been able to reproduce this issue

nordicjm avatar Mar 17 '22 11:03 nordicjm

@dcpleung What would your thoughts be on adding a lock/unlock mutex to the UART API which can be used by components that use polling to output data, to prevent one system from interfering with another that is currently using it? Flow would be:

  • Lock UART
  • Call poll function as many times as needed in a loop
  • Unlock UART

Everything would then need to make use of this, e.g. shell, logging, smp so that only one outputs at a time

nordicjm avatar Mar 18 '22 09:03 nordicjm

Don't think a mutex would work during kernel panic where messages are still being printed.

dcpleung avatar Mar 18 '22 21:03 dcpleung

Based on the exposition of information and events given by @lairdjm I am assigning this to @nordic-krch as the Nordic UART driver maintainer. CC @anangl

carlescufi avatar Mar 21 '22 14:03 carlescufi

https://github.com/lairdjm/zephyr/tree/smpfixtest is where I stored the test code I used to find where the issue is if it's of any use

nordicjm avatar Mar 21 '22 14:03 nordicjm

@lairdjm i briefly took a look. It seems to me that smp_shell.c should not use UART directly but write to shell. Otherwise they compete for UART. poll_out on nordic devices works in a way that it's ok to compete for it and all data will be printed but it may be interleaved.

Logging, like SMP, uses the uart_poll_out function directly

That's not true. When shell is enabled on UART then raw uart logging backend is disabled and it goes through shell to ensure that data is not interleaved. I will take a look what would it take to make smp_shell print to shell.

However, i cannot explain why it worked in some older builds.

nordic-krch avatar Apr 11 '22 10:04 nordic-krch

SMP cannot write to shell as the commands are not part of the shell and need to be output on their own, without any additions or changes from the shell component e.g. adding timestamps, module names, colours, etc.

nordicjm avatar Apr 11 '22 10:04 nordicjm

Can you try to use shell mutex around printing command, something like: https://gist.github.com/nordic-krch/c6d655b1e2a67b05645f46606b5094a8

If that help we can try to extend shell API to allow dumping user data to shell.

nordic-krch avatar Apr 11 '22 11:04 nordic-krch

@nordic-krch Unfortunately it does not seem to resolve the issue

nordicjm avatar Jun 07 '22 07:06 nordicjm

Actually it might work but be one of 2 issues, I've added test code that disables logging when an SMP command is received and that works much more reliably but it does still fail and require me to stop the test and manually send a newline to get the response, so there is a UART input error which is part of this issue also.

nordicjm avatar Jun 07 '22 07:06 nordicjm

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

github-actions[bot] avatar Aug 07 '22 00:08 github-actions[bot]

Created a sample application which shows this issue: https://github.com/nordicjm/zephyr/tree/uart_mcu_issue

Build zephyr/samples/subsys/mgmt/mcumgr/smp_svr for a board with overlay-shell-mgmt.conf and overlay-shell.conf overlays then flash. I was using https://github.com/LairdCP/UwTerminalX to test this issue but anything that sends/receives data in a loop and checks it can be used. For UwTerminalX, open it to the serial port, uncheck the local echo checkbox, right click the background and select scripting, paste this in:

>\06\09AFgCAABOAAABAL9hZHhIc2hvcnQgbWVzc2FnZSB0aGF0IGhvcGVmdWxseSBhbiBucmY1MSBjYW4gaGFuZGxlIHByb3Blcmx5Li4uIG1heWJlIGEgYml0/85i\n
<\06\09AFgDAABOAAABAL9hcnhIc2hvcnQgbWVzc2FnZSB0aGF0IGhvcGVmdWxseSBhbiBucmY1MSBjYW4gaGFuZGxlIHByb3Blcmx5Li4uIG1heWJlIGEgYml0/4GL\n

Change repeats to -1 and click play, it will send a few times then stop because the receive is not complete. Output in the main window will show where it was interrupted:

\06	AFgDAABOAAABAL9hcnhIc2hvcnQgbWVzc2FnZSB0aGF0IGhvcGVmdWxseSBhbiBucmY1MSBjYW4gaGFuZGxlIHByb3Blcmx5Li4uIG1heWJlIGEgYml0/4GL
\06	AFgDAABOAAABAL9hcnhIc2hvcnQgbWVzc2FnZSB0aGF0IGhvcGVmdWxseSBhbiBucmY1MSBjYW4gaGFuZGxlIHByb3Blcmx5Li4uIG1heWJlIGEgYml0/4GL
test output b\06	AFgDAABOAAABAL9hcnhIc2hvcnQgbWVzc2FnZSB0aGF0IGhvcGVmdWxseSBhbiBucmY1MSBjYW4gaGFuZGxlIHByb3Blcmx5Li4uIG1heWJlIGEgYml0/4GL
\1b[8D\1b[J[00:00:01.530,029] \1b[1;31m<err> cto: test output a\1b[0m
\1b[1;32muart:~$ \1b[m\1b[8D\1b[J[00:00:01.596,130] \1b[1;31m<err> cto: test output a\1b[0m
\1b[1;32muart:~$ \1b[m\1b[8D\1b[J[00:00:01.662,200] \1b[1;31m<err> cto: test output a\1b[0m
\1b[1;32muart:~$ \1b[m\1b[8D\1b[J[00:00:01.728,271] \1b[1;31m<err> cto: test output a\1b[0m
\1b[1;32muart:~$ \1b[m\1b[8D\1b[J[00:00:01.794,372] \1b[1;31m<err> cto: test output a\1b[0m
\1b[1;32muart:~$ \1b[m\1b[8D\1b[J[00:00:01.860,443] \1b[1;31m<err> cto: test output a\1b[0m
\1b[1;32muart:~$ \1b[m\1b[8D\1b[J[00:00:01.926,513] \1b[1;31m<err> cto: test output a\1b[0m
\1b[1;32muart:~$\06 \1b[	AFgDAABOAAABAL9hcnhIc2hvcnQgbWVzc2FnZSB0aGF0IGhvcGVmdWxseSBhbiBucmY1MSBjYW4gaGFuZGxlIHByb3Blcmx5Li4uIG1heWJlIGEgYml0/4GL
test output btest output btest output b\1b[8D\1b[J[00:00:01.992,584] \1b[1;31m<err> cto: test output a\1b[0m
\1b[1;32muart:~$ \1b[m\1b[8D\1b[J[00:00:02.058,654] \1b[1;31m<err> cto: test output a\1b[0m

Can see the start header, \06, part the way through a shell output

nordicjm avatar Sep 15 '22 08:09 nordicjm

Enabling CONFIG_LOG_PRINTK resolves this issue, have noted in the linked thread about a similar bug that making this the default when logging is enabled would resolve some bugs

nordicjm avatar Sep 21 '22 12:09 nordicjm

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

github-actions[bot] avatar Nov 21 '22 00:11 github-actions[bot]