micropython-lib usb: Fix USB device library transfer thread safety

The USB pending transfer flag was cleared before calling the completion callback, to allow the callback code to call submit_xfer() again.

Unfortunately this isn't safe in a multi-threaded environment, as another thread may see the endpoint is available before the callback is done executing and submit a new transfer.

Rather than adding extra locking, specifically treat the transfer as still pending if checked from another thread while the callback is executing.

Together with https://github.com/micropython/micropython/pull/15264 this closes #874.

Tested using the sample code from the linked issue.

This work was funded through GitHub Sponsors.

Jul 11 '24 01:07 projectgus

Previous version of this PR had a commit which added locking in the write path, but this doesn't actually solve the problem of writing to a Buffer from multiple threads concurrently. Added a note in the README about this, instead.

(@mischif this part isn't related to the bug you submitted, as you're only writing from one producer. Just something I noticed in testing.)

Jul 11 '24 01:07 projectgus

@mischif are you able to test this fix? If you get a recent MicroPython nightly that includes the other fix, then you should be able to install the usb-device package directly from this branch with:

$ mpremote connect PORT mip install --index https://projectgus.github.io/micropython-lib/mip/bugfix/usb_pending_xfer_threads usb-device

(As per https://github.com/micropython/micropython-lib/?tab=readme-ov-file#installing-packages-from-forks )

Jul 11 '24 01:07 projectgus

(For the record, while the code in the issue replicates the bug it's not exactly my use case; I do want to write from both threads but I want to write to ttyACM1 on core 0 and ttyACM0 on core 1)

I've updated my Pico to run 1.24.0 preview 103, the above usb-device package and the usb-device-cdc code from #886 and can confirm I no longer see either of the errors I included in micropython/micropython-lib#874, most importantly the AssertionError that was very common.

I have seen this error during testing a couple times, but it doesn't seem to affect my code:

Traceback (most recent call last):
  File "usb/device/core.py", line 340, in _xfer_cb
  File "/lib/usb/device/cdc.py", line 346, in _rd_cb
  File "/lib/usb/device/cdc.py", line 340, in _rd_xfer
  File "usb/device/core.py", line 601, in submit_xfer
  File "usb/device/core.py", line 313, in _submit_xfer
RuntimeError: maximum recursion depth exceeded

Jul 12 '24 20:07 mischif

@projectgus OK, the initial bug which led me to create the issue seems pretty much solved now, but now I have what I think is a related issue. My code looks something like this:

async def api_hdlr():
    [...]
        elif SCRATCH[2] == CFG_READ:
            with open(CONFIG.path, "rb") as f:
                SCRATCH[:256] = f.read(256)
                print(SCRATCH_DATA[:256], file=SERIAL_TWO, end="\r\n")
        [...]

The function is run in a loop on a secondary thread; I want to read the whole file but the issue shows with the first 256 bytes. The print statement seems to trigger an exception at various lines but the result is the same:

Traceback (most recent call last):
  File "usb/device/core.py", line 340, in _xfer_cb
  File "/lib/usb/device/cdc.py", line 328, in _wr_cb
  File "/lib/usb/device/cdc.py", line 321, in _wr_xfer
  File "usb/device/core.py", line 562, in xfer_pending
RuntimeError: maximum recursion depth exceeded

Let me know if you want me to split this into a new issue, but it may not be as easy to recreate this bug standalone.

Jul 12 '24 23:07 mischif

The function is run in a loop on a secondary thread; I want to read the whole file but the issue shows with the first 256 bytes. The print statement seems to trigger an exception at various lines but the result is the same:

This one is hopefully simple: the stack size of the secondary thread probably isn't big enough. The default stack size for the secondary thread on rp2 is 4096 bytes. You can increase it by calling _thread.stack_size(8192) (or similar) before you create the new thread. See the CPython function docs for more info: https://docs.python.org/3.5/library/_thread.html#_thread.stack_size (although note the exact sizes documented there don't apply to MicroPython.)

The transfer callback pushes onto the stack, and async coroutines may push more data on the stack, so when a callback is triggered while a coro is running then the total stack usage is peaking.

Jul 16 '24 01:07 projectgus

I haven't seen a RuntimeError since increasing the stack size so I think that was the issue; now I have another issue that may or may not be related:

async def api_hdlr():
    [...]
        elif SCRATCH[2] == CFG_READ:
            chunk_size = 256
            with open(CONFIG.path, "rb") as f:
                cfg_left = f.seek(0, 2)
                f.seek(0)
                while cfg_left:
                    SCRATCH[:chunk_size] = f.read(chunk_size)
                    SERIAL_TWO.write(SCRATCH[:chunk_size])
                    cfg_left -= chunk_size

                    if cfg_left < chunk_size:
                        chunk_size = cfg_left

                    await sleep_ms(20)
        [...]

This only writes ~70% of the file out over serial, I get 55034 bytes of a 77341 byte file; reducing the chunk size to as low as 64 or increasing the sleep time to 750ms so the buffer has more time to flush doesn't change the amount written, and calling SERIAL_TWO.flush() raises ETIMEDOUT because I initialized the interface with a 0 timeout.

Is there some other way to flush the serial buffer, or figure out what's going on?

Jul 16 '24 04:07 mischif

@mischif Not sure about this one, unfortunately. One thing I'd note is that if the timeout is 0 then you need to check the result of write() for the number of bytes that were actually written, as it may be less than asked for. If that clue doesn't help then please open a separate issue for it.

Jul 17 '24 06:07 projectgus

@projectgus Turns out the incomplete writes were because of an interaction with my testing code; after fixing that everything works great so I think this issue is good to close.

Jul 17 '24 21:07 mischif