USB not recognized after intermittent overflow
The USB connection fails intermittently (more often than not) while trying to load coefficients from the LibreCAL. I've narrowed the issue down to the LibreCAL embedded software, although initial discovery of this issue was using the LibreVNA app v1.6.2 during calibration on a Windows 11 machine using the factory provided USB cable. My LibreCAL was delivered with firmware v0.2.1 and calibration completed successfully several times through the LibreVNA app. I noticed that the calibration seemed a bit off, so I updated the firmware to v0.2.3 and completed the update of my Factory calibration files per https://github.com/jankae/LibreCAL/issues/23. This all worked fine, but I was unable to run calibration through the LibreVNA app after that which led me down this path of troubleshooting.
To isolate the issue, I tried using the LibreCAL app v0.2.3 to load FACTORY coefficients from the device (LibreCAL embedded software was also at v0.2.3). The coefficient loading process began, started to lag, and eventually reached 100%, but only some of the coefficients are loaded. Upon failure the LibreCAL app loses connection to the LibreCAL and Windows indicates that the USB device is not recognized. This failure was also replicated on an additional Windows 10 machine and a known good USB cable (the USB-C to USB-C cable used to run the LibreVNA).
To further isolate , I used the virtual COM port interface, Tera Term and the SCPI command ":COEFFicient:GET? FACTORY P1_OPEN." This command worked, but intermittently fails during the response after a few repeated requests. Commands with shorter responses never fail (i.e. ":COEFFicient:NUMber? FACTORY P1_OPEN" or ":FIRMWARE?') even when using a Tera Term macro to send the requests repeatedly (every 10-100 ms) for more than 10 minutes.
I suspected some sort of buffer overflow during USB tx, so I tried to isolate the issue in code by adding some debug output to the Touchstone::PrintFile() function, but the issue persisted and I don't see a clear connection to the USB buffer status.
static void debug_log(const char* msg) {
tud_cdc_write_str(msg);
tud_cdc_write_flush();
}
if(br > 0) {
char logbuf[64];
snprintf(logbuf, sizeof(logbuf), "[f_read br=%d]\r\n", br);
debug_log(logbuf);
snprintf(logbuf, sizeof(logbuf), "[usb available=%d]\r\n", tud_cdc_write_available());
debug_log(logbuf);
debug_log("[tx_func start]\r\n");
tx_func(buffer, br, interface);
debug_log("[tx_func end]\r\n");
vTaskDelay(1);
}
if(br < sizeof(buffer)) {
break;
}
I made several captures in Wireshark and have saved the log files, but apparently can't share them on Github. Here is a screenshot of one of the captures while I was using Tera Term to troubleshoot.
I'm happy to continue to troubleshoot, but am sort of at a loss as to what to try next. I thought I would share my findings and hope someone has some ideas.
Thank you for all the work and narrowing it down. You mentioned that you updated to firmware 0.2.3 but have you tried the latest development version as well? (download here: https://github.com/jankae/LibreCAL/actions/runs/13782279765).
I do remember seeing occasional USB issues which stopped after I updated the Pico SDK in 4e90efacc2aa6cb49272e5c4a0247e08cd54e76b. With a bit of luck this already solves your problem and if that is the case it is really time for a 0.2.4 release.
Any luck with the development version of the LibreCAL?
There are some issues in the toolchain which I can not quite figure out yet. Switching to SDK version 2.1.1 made it much more stable for me but then Github deprecated the Ubuntu 20.04 runner (which was building the embedded firmware). The 22.04 or 24.04 runners updated GCC from version 9.3 to 10.3.
And for some reason building the exact same code with 10.3 results in a firmware that crashes/throws errors when reading and/or writing calibration coefficients. Even newer GCC versions also exhibit problems although the exact symptom varies. I am bit clueless right now because I do not really believe in a compiler bug but have no better explanation.
For now I am manually installing and using GCC 9.3 in the CI/CD which seems to be a feasible workaround. Could you please try the firmware from this build? https://github.com/jankae/LibreCAL/actions/runs/15323882295
@jankae Do you have tested toolchainVersion 14_2_Rel1 (it is based on arm-none-eabi v14.2.1) used by Pico SDK 2.1.1 for Visual Studio (Raspberry Pi Pico Visual Studio Code extension in VS Code) ?
Yes I did and the problems changed but did not disappear.
But I believe I finally found something (turns out that attaching a debugger and adding UART output actually helps). So far I have identified two different problems:
Problem 1:
Two different tasks call functions from the TinyUSB library: defaultTask and TinyUSB. TinyUSB can be configured to use FreeRTOS mutexes with the option OPT_OS_FREERTOS which would use a FreeRTOS mutex for access control to USB endpoint streams. But apparently that option is hardcoded to OPT_OS_PICO in the SDK/TInyUSB:
Some links with explanations
And the mutex implementation with that option has no concept of task priorities. If the lower priority task (which is TinyUSB) grabs the mutex first and then the higher priority task tries to lock it, that higher priority task prevents the lower priority task from ever unlocking it and we have a deadlock.
The only solution (apart from changing to OPT_OS_FREERTOS in the SDK/TinyUSB which I apparently can not do from my code if I want to keep using the standard toolchain) is to give all tasks the same priority (and having preemption turned on in FreeRTOS).
With that change I no longer observe any deadlock but they were reasonable infrequent before so I do not fully trust the solution yet.
Problem 2:
With newer compiler versions writing to the Flash sometimes fails (which then escalates to a fatfs error and ultimately an error response on the SCPI interface). The problem is the timing of the CS pin: The CS release between the command Write Enable Latch and an erase or write operation is very short (8-10ns). The flash chip requires a longer time. Adding a sleep_us(1) there seems to solve the issue.
I guess both problems are ultimately down to timing and that can obviously change if newer compiler versions are able to optimize more.
I am cautiously optimistic that I have now identified the root causes and can use newer toolchain versions. Commits of these changes will follow.
Code changes are added, please test again with v0.3.0
Jan...I've tested v0.3.0 and can confirm it has fixed all of the issues that I was experiencing. Thanks for taking the time to figure it out.
Awesome, thank you testing!