Implement MSG_ZEROCOPY for blight messages
@Eeems Here's a prototype implementation of `send_blocking` that uses MSG_ZEROCOPY:
// Helper struct to track zero-copy buffers
struct zerocopy_buffer {
blight_data_t data;
size_t size;
int refcount;
// Could add a callback for when the buffer is fully released
};
// Global buffer tracking (in a real implementation, this would be more sophisticated)
#include <unordered_map>
static std::unordered_map<blight_data_t, zerocopy_buffer*> zerocopy_buffers;
static std::mutex zerocopy_mutex;
// Register a buffer for zero-copy transmission
zerocopy_buffer* register_zerocopy_buffer(blight_data_t data, size_t size) {
std::lock_guard<std::mutex> lock(zerocopy_mutex);
auto buffer = new zerocopy_buffer{data, size, 1};
zerocopy_buffers[data] = buffer;
return buffer;
}
// Function signature changed to return the zerocopy_buffer for the caller to track
zerocopy_buffer* send_blocking_zerocopy(int fd, const blight_data_t data, ssize_t size) {
// Register the buffer for zero-copy transmission
zerocopy_buffer* buffer = register_zerocopy_buffer(const_cast<blight_data_t>(data), size);
ssize_t sent = 0;
ssize_t res = 0;
while (sent < size) {
if (!wait_for_send(fd)) {
if (errno == EAGAIN || errno == EINTR) {
short_pause();
continue;
}
// Clean up on failure
release_zerocopy_buffer(buffer);
return nullptr;
}
// Use MSG_ZEROCOPY flag for zero-copy transmission
res = ::send(fd, &data[sent], size - sent, MSG_NOSIGNAL | MSG_ZEROCOPY);
if (res > 0) {
sent += res;
continue;
}
if (res == 0) {
// Connection closed
errno = ECONNRESET;
release_zerocopy_buffer(buffer);
return nullptr;
}
if (errno != EAGAIN && errno != EINTR) {
// Unexpected error
release_zerocopy_buffer(buffer);
return nullptr;
}
// Temporary error, retry
short_pause();
}
// Setup poll to monitor for completion events if needed
return buffer;
}
// Process zero-copy completion notifications
bool process_zerocopy_completions(int fd) {
struct sock_extended_err err;
struct msghdr msg = {};
struct iovec iov;
char control[CMSG_SPACE(sizeof(struct sock_extended_err))];
msg.msg_control = control;
msg.msg_controllen = sizeof(control);
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
int res = recvmsg(fd, &msg, MSG_ERRQUEUE);
if (res == -1)
return false;
struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
if (!cmsg || cmsg->cmsg_level != SOL_IP || cmsg->cmsg_type != IP_RECVERR)
return false;
memcpy(&err, CMSG_DATA(cmsg), sizeof(err));
if (err.ee_errno != 0 || err.ee_origin != SO_EE_ORIGIN_ZEROCOPY)
return false;
// Process completion notification
// err.ee_data and err.ee_info contain information about completed buffers
std::lock_guard<std::mutex> lock(zerocopy_mutex);
// Update reference counts for completed buffers
// In a real implementation, we'd map err.ee_data to specific buffers
return true;
}
// Release a zero-copy buffer (decrease refcount, free if refcount reaches 0)
void release_zerocopy_buffer(zerocopy_buffer* buffer) {
if (!buffer)
return;
std::lock_guard<std::mutex> lock(zerocopy_mutex);
if (--buffer->refcount <= 0) {
zerocopy_buffers.erase(buffer->data);
delete[] buffer->data; // Only free the data when refcount reaches 0
delete buffer;
}
}
This implementation:
- Creates a tracking system for zero-copy buffers
- Uses MSG_ZEROCOPY flag with send()
- Provides a mechanism to track when it's safe to free the buffer
- Includes a function to process completion notifications
Note that in a real implementation, you'd want to integrate this with your event loop to process the MSG_ERRQUEUE messages that signal when buffers can be released.
🧠 Learnings used
Learnt from: Eeems
PR: Eeems-Org/oxide#397
File: shared/libblight_protocol/socket.cpp:74-99
Timestamp: 2025-03-14T05:57:20.573Z
Learning: In shared/libblight_protocol/socket.cpp, the `recv_blocking` method is intentionally designed to block indefinitely, as it's only meant to be used in situations where data is guaranteed to be available due to a previous header being received. The non-blocking alternative with timeouts and retry limits is implemented in the `recv` method in the same file.
Originally posted by @coderabbitai[bot] in https://github.com/Eeems-Org/oxide/pull/397#discussion_r1996601190
_💡 Verification agent_
🧩 Analysis chain
Consider implementing the MSG_ZEROCOPY TODO as a separate function.
Based on previous review discussions, implementing MSG_ZEROCOPY would require a different API that manages buffer ownership. Consider implementing this as a separate function rather than modifying the existing one, so clients can opt into the more complex but potentially more efficient approach when needed.
🌐 Web query:
What are the performance benefits of MSG_ZEROCOPY in Linux socket operations?
💡 Result:
The MSG_ZEROCOPY feature in Linux socket operations provides significant performance benefits by eliminating data copies between user and kernel space, particularly for high-throughput applications. Here's a detailed breakdown of its advantages and considerations:
Throughput Improvements
- Large packet performance: MSG_ZEROCOPY improves throughput by 39–68 million packets per second (Mpps) in benchmark tests, compared to 15–25 Mpps without zero-copy[1][7]. For TCP, this translates to ~2x higher throughput in optimized scenarios[1].
- Efficient bulk transfers: It achieves 92% reduction in process CPU cycles for microbenchmarks like netperf and 5–8% improvement in production workloads (e.g., CDN servers)[7][12].
CPU Efficiency
- Reduced copy overhead: Avoids per-byte copy costs by pinning user buffers directly to kernel network stacks[4][11].
- Lower context switching: Replaces syscall-heavy operations with asynchronous notifications via error queues[2][4].
- Scalability: At 10 Gbps line rate, MSG_ZEROCOPY reduces CPU utilization from 80% to 20% compared to traditional methods[15].
Payload Size Considerations
| Payload Size | Effectiveness | Notes |
|---|---|---|
| <10 KB | Limited | Page-pinning overhead outweighs benefits[2][4] |
| ≥10 KB | Optimal | Benchmarks show 30–50% latency reduction[4][13] |
| 8 KB (SPDK) | Beneficial | SPDK reports gains even below 10 KB for NVMe/TCP[13] |
Implementation Requirements
- Socket setup:
setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &enable, sizeof(enable)); - Transmission:
send(fd, buffer, len, MSG_ZEROCOPY); - Notification handling (via error queue):
recvmsg(fd, &msg, MSG_ERRQUEUE);
Caveats and Limitations
- No loopback optimization: Local socket transfers still use copies[4][7].
- Notification latency: Requires explicit polling of error queues[2][4].
- Protocol limitations: Works best with TCP/UDP; headers are still copied for small packets[1][4].
- Memory management: Pinned pages increase memory pressure for long-lived connections[4][12].
Real-World Impact
- Content Delivery Networks (CDNs): 5–8% overall performance boost[7].
- NVMe/TCP storage: Enables line-rate 100 GbE performance with 8K blocks[13].
- Machine learning pipelines: Reduces data prep overhead in distributed training[7].
MSG_ZEROCOPY is particularly effective for applications sending large payloads over high-speed networks, though proper buffer management and notification handling are critical. For sub-10KB transfers or local communication, traditional methods may remain preferable[1][4][11].
Citations:
- 1: https://blogs.oracle.com/linux/post/zero-copy-networking-in-uek6
- 2: https://patchwork.ozlabs.org/patch/808430/
- 3: https://www.sobyte.net/post/2022-03/golang-zero-copy/
- 4: https://www.kernel.org/doc/html/latest/networking/msg_zerocopy.html
- 5: https://gist.github.com/congwang/c1dafe8be4a756fc1a22b53a3d42c716
- 6: https://developer.ibm.com/articles/j-zerocopy/
- 7: https://netdevconf.info/2.1/papers/debruijn-msgzerocopy-talk.pdf
- 8: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/net/msg_zerocopy.c
- 9: https://dl.acm.org/doi/pdf/10.1109/SCW63240.2024.00111
- 10: https://www.youtube.com/watch?v=9F8I9vYQydU
- 11: https://fasterdata.es.net/host-tuning/linux/app-programming/
- 12: https://netdevconf.info/0x18/docs/netdev-0x18-paper14-talk-paper.pdf
- 13: https://ci.spdk.io/download/2020-virtual-forum-us/02_SPDK_NVMe_over_TCP_Ben_Seth.pdf
- 14: https://blogoverflow.hashnode.dev/unleashing-the-power-of-zero-copy-networking-revolutionizing-data-transfer-efficiency
- 15: https://lwn.net/Articles/730010/
Action Required: Refactor MSG_ZEROCOPY Implementation
The current TODO comment hints at the need for proper buffer ownership when exploring MSG_ZEROCOPY. Given that MSG_ZEROCOPY can offer significant throughput improvements and lower CPU usage in high-load scenarios—but only when buffer management is correctly handled—the added complexity should not be mixed into the existing API.
- Separate Function: Implement the MSG_ZEROCOPY logic as a discrete function. This isolation will allow clients to opt in to the more complex, high-performance path without disrupting the standard behavior.
- Buffer Ownership & Notification Handling: Ensure that this new API carefully manages buffer lifetimes and handles error notifications (i.e., using the kernel error queue) to prevent premature buffer deletion.
Originally posted by @coderabbitai[bot] in https://github.com/Eeems-Org/oxide/pull/397#discussion_r1996603928