recvmsg_multishot + IOU_PBUF_RING_INC causes spurious BADADDR errors
As the title says, combining recvmsg_multishot with IOU_PBUF_RING_INC causes spurious BADADDR errors. Is this an unsupported combo or just an edge case that I've discovered? My use case is that I want to recv data along with kernel & hardware timestamps for each message. Another constraint is that all the bytes read previously & the new bytes read have to be contiguous in userspace, to simplify the websocket protocol layer. To achieve this, I split my 2MB circular buffer into two 1MB parts & create a buffer ring with both these parts & IOU_PBUF_RING_INC flag set. Then I do a recv_multishot & read the number of bytes (from the CQE result) from the buffer. After reading, I move my userspace tail pointer ahead by the number of bytes processed (some bytes might be left due to partial websocket frames). When the tail crosses the 1MB or 2MB mark, I return the 1st or the 2nd buffer to the kernel respectively. This works perfectly with recv_multishot, but not with recvmsg_multishot (I know I've to shift the bytes in my buffer due to the msghdr).
Say the msghdr requires 24 bytes. I think the problem is that io-uring will use the last buffer it used from the buffer ring & sometimes that buffer will have less than 24 bytes left. Then the networking stack at some point calls put_cmsg, but there aren't enough bytes to do so & we get BADADDR. I've a very rough rust example demonstrating this here. You can increase the msg_controllen & see the kernel return the error. This is because the buffer size is 32 + 18 = 50 & the first msg requires 16 (msghdr) + 1 (controllen) + 15 (payload) = 32 bytes. For the second msg you require at least 17 bytes for the msghdr & here we're left with 50 - 32 = 18 bytes, so you get 1 byte from the payload. When you increase controllen to say 2, you need 33 bytes for the first msg & at least 18 bytes for the second msg, but we're only left with 50 - 33 = 17 bytes & we get an error.
One solution to this is to just accept that you'll get such errors every once in a while & just reset the buffer ring & continue. Another solution would be to explicitly check if there are enough bytes left in the buffer & skipping it if necessary.
Instead, I think a good solution would be to have separate buffers for msghdr. This would be perfect for my particular use case, as I won't have to shift bytes in my buffer to get rid of msghdr in the middle. But I think it will also simplify the API because you won't need io_uring_recvmsg_out & other methods anymore as the msghdr buffer will have the same structure as a normal msghdr struct (I think). One way to do that would be to add a 64-byte or a 128-byte CQE & use the extra bytes there for msghdr. This would constrain the msghdr size to 128 - 16 bytes. I've only ever needed 48 bytes for the control msg (so 64 bytes for msghdr total), so I don't know if this is a good solution or not. Another way would be to allow passing a separate buffer group for msghdr & post the buffer id for msghdr in the CQE separately. This is the most flexible solution, although the API would get a bit complicated.
CC @DylanZA as he wrote the original multishot recvmsg code.
I'll ponder this a bit as well. As an initial fix, I think checking if there are enough bytes left and skipping makes sense. But it's also worth pondering a potentially better solution, like having a separate buffer group just for the msghdr. The NOTIF part of CQEs could be used for this.
One way to do that would be to add a 64-byte or a 128-byte CQE & use the extra bytes there for msghdr. This would constrain the msghdr size to 128 - 16 bytes. I've only ever needed 48 bytes for the control msg (so 64 bytes for msghdr total), so I don't know if this is a good solution or not.
I think thats unlikely to be a robust solution - but cannot say for sure. I would expect there are some sockets where you routinely have much bigger headers. It seems like SCM_RIGHTS on unix sockets for example can be much larger.
Instead, I think a good solution would be to have separate buffers for msghdr. This would be perfect for my particular use case, as I won't have to shift bytes in my buffer to get rid of msghdr in the middle. But I think it will also simplify the API because you won't need io_uring_recvmsg_out & other methods anymore as the msghdr buffer will have the same structure as a normal msghdr struct (I think).
I agree that might make this easier. But am not sure it's worth it due to extra complication. Like it would be difficult to communicate which of the two buffers are overflowing for example.
I don't have a great solution here other than doing a memmove before returning data to the user. I doubt it will be massively impactful - but its not fantastic.
As an initial fix, I think checking if there are enough bytes left and skipping makes sense.
Is there any way to tell io-uring it should stop using a buffer and start using the next one in a ring?
As an initial fix, I think checking if there are enough bytes left and skipping makes sense.
Is there any way to tell io-uring it should stop using a buffer and start using the next one in a ring?
Even if there was a way, what if more data arrives before your SQE is processed.
Good point. The only thing I can think of is to tell it to switch well before you get close to the end.
Good point. The only thing I can think of is to tell it to switch well before you get close to the end.
I think at that point I would just not use RING_INC at all. The reason I use RING_INC is that I often get full TLS payloads (16k is the cap) & there will be partial websocket frames at the end of them. In that case, I want to avoid having the partial websocket frame in 2 different buffers. That is why I'm using incremental mode so that I don't have to care about it. Using recvmsg_multishot already makes this worse because I've to shift the buffer by 40 bytes (that is the msghdr size for me), which is why I suggested writing the msghdrs to a separate buffer. Currently I'm using recv_multishot to avoid that problem as well.