oxide icon indicating copy to clipboard operation
oxide copied to clipboard

Implement MSG_ZEROCOPY for blight messages

Open Eeems opened this issue 9 months ago • 1 comments

          @Eeems Here's a prototype implementation of `send_blocking` that uses MSG_ZEROCOPY:
// Helper struct to track zero-copy buffers
struct zerocopy_buffer {
    blight_data_t data;
    size_t size;
    int refcount;
    // Could add a callback for when the buffer is fully released
};

// Global buffer tracking (in a real implementation, this would be more sophisticated)
#include <unordered_map>
static std::unordered_map<blight_data_t, zerocopy_buffer*> zerocopy_buffers;
static std::mutex zerocopy_mutex;

// Register a buffer for zero-copy transmission
zerocopy_buffer* register_zerocopy_buffer(blight_data_t data, size_t size) {
    std::lock_guard<std::mutex> lock(zerocopy_mutex);
    auto buffer = new zerocopy_buffer{data, size, 1};
    zerocopy_buffers[data] = buffer;
    return buffer;
}

// Function signature changed to return the zerocopy_buffer for the caller to track
zerocopy_buffer* send_blocking_zerocopy(int fd, const blight_data_t data, ssize_t size) {
    // Register the buffer for zero-copy transmission
    zerocopy_buffer* buffer = register_zerocopy_buffer(const_cast<blight_data_t>(data), size);
    
    ssize_t sent = 0;
    ssize_t res = 0;
    
    while (sent < size) {
        if (!wait_for_send(fd)) {
            if (errno == EAGAIN || errno == EINTR) {
                short_pause();
                continue;
            }
            // Clean up on failure
            release_zerocopy_buffer(buffer);
            return nullptr;
        }
        
        // Use MSG_ZEROCOPY flag for zero-copy transmission
        res = ::send(fd, &data[sent], size - sent, MSG_NOSIGNAL | MSG_ZEROCOPY);
        
        if (res > 0) {
            sent += res;
            continue;
        }
        
        if (res == 0) {
            // Connection closed
            errno = ECONNRESET;
            release_zerocopy_buffer(buffer);
            return nullptr;
        }
        
        if (errno != EAGAIN && errno != EINTR) {
            // Unexpected error
            release_zerocopy_buffer(buffer);
            return nullptr;
        }
        
        // Temporary error, retry
        short_pause();
    }
    
    // Setup poll to monitor for completion events if needed
    return buffer;
}

// Process zero-copy completion notifications
bool process_zerocopy_completions(int fd) {
    struct sock_extended_err err;
    struct msghdr msg = {};
    struct iovec iov;
    
    char control[CMSG_SPACE(sizeof(struct sock_extended_err))];
    msg.msg_control = control;
    msg.msg_controllen = sizeof(control);
    msg.msg_iov = &iov;
    msg.msg_iovlen = 1;
    
    int res = recvmsg(fd, &msg, MSG_ERRQUEUE);
    if (res == -1)
        return false;
    
    struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
    if (!cmsg || cmsg->cmsg_level != SOL_IP || cmsg->cmsg_type != IP_RECVERR)
        return false;
    
    memcpy(&err, CMSG_DATA(cmsg), sizeof(err));
    if (err.ee_errno != 0 || err.ee_origin != SO_EE_ORIGIN_ZEROCOPY)
        return false;
    
    // Process completion notification
    // err.ee_data and err.ee_info contain information about completed buffers
    std::lock_guard<std::mutex> lock(zerocopy_mutex);
    // Update reference counts for completed buffers
    // In a real implementation, we'd map err.ee_data to specific buffers
    
    return true;
}

// Release a zero-copy buffer (decrease refcount, free if refcount reaches 0)
void release_zerocopy_buffer(zerocopy_buffer* buffer) {
    if (!buffer)
        return;
        
    std::lock_guard<std::mutex> lock(zerocopy_mutex);
    if (--buffer->refcount <= 0) {
        zerocopy_buffers.erase(buffer->data);
        delete[] buffer->data;  // Only free the data when refcount reaches 0
        delete buffer;
    }
}

This implementation:

  1. Creates a tracking system for zero-copy buffers
  2. Uses MSG_ZEROCOPY flag with send()
  3. Provides a mechanism to track when it's safe to free the buffer
  4. Includes a function to process completion notifications

Note that in a real implementation, you'd want to integrate this with your event loop to process the MSG_ERRQUEUE messages that signal when buffers can be released.


🧠 Learnings used
Learnt from: Eeems
PR: Eeems-Org/oxide#397
File: shared/libblight_protocol/socket.cpp:74-99
Timestamp: 2025-03-14T05:57:20.573Z
Learning: In shared/libblight_protocol/socket.cpp, the `recv_blocking` method is intentionally designed to block indefinitely, as it's only meant to be used in situations where data is guaranteed to be available due to a previous header being received. The non-blocking alternative with timeouts and retry limits is implemented in the `recv` method in the same file.

Originally posted by @coderabbitai[bot] in https://github.com/Eeems-Org/oxide/pull/397#discussion_r1996601190

Eeems avatar Mar 15 '25 05:03 Eeems

          _💡 Verification agent_
🧩 Analysis chain

Consider implementing the MSG_ZEROCOPY TODO as a separate function.

Based on previous review discussions, implementing MSG_ZEROCOPY would require a different API that manages buffer ownership. Consider implementing this as a separate function rather than modifying the existing one, so clients can opt into the more complex but potentially more efficient approach when needed.


🌐 Web query:

What are the performance benefits of MSG_ZEROCOPY in Linux socket operations?

💡 Result:

The MSG_ZEROCOPY feature in Linux socket operations provides significant performance benefits by eliminating data copies between user and kernel space, particularly for high-throughput applications. Here's a detailed breakdown of its advantages and considerations:


Throughput Improvements

  • Large packet performance: MSG_ZEROCOPY improves throughput by 39–68 million packets per second (Mpps) in benchmark tests, compared to 15–25 Mpps without zero-copy[1][7]. For TCP, this translates to ~2x higher throughput in optimized scenarios[1].
  • Efficient bulk transfers: It achieves 92% reduction in process CPU cycles for microbenchmarks like netperf and 5–8% improvement in production workloads (e.g., CDN servers)[7][12].

CPU Efficiency

  • Reduced copy overhead: Avoids per-byte copy costs by pinning user buffers directly to kernel network stacks[4][11].
  • Lower context switching: Replaces syscall-heavy operations with asynchronous notifications via error queues[2][4].
  • Scalability: At 10 Gbps line rate, MSG_ZEROCOPY reduces CPU utilization from 80% to 20% compared to traditional methods[15].

Payload Size Considerations

Payload Size Effectiveness Notes
<10 KB Limited Page-pinning overhead outweighs benefits[2][4]
≥10 KB Optimal Benchmarks show 30–50% latency reduction[4][13]
8 KB (SPDK) Beneficial SPDK reports gains even below 10 KB for NVMe/TCP[13]

Implementation Requirements

  1. Socket setup:
    setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &enable, sizeof(enable));
    
  2. Transmission:
    send(fd, buffer, len, MSG_ZEROCOPY);
    
  3. Notification handling (via error queue):
    recvmsg(fd, &msg, MSG_ERRQUEUE);
    

Caveats and Limitations

  • No loopback optimization: Local socket transfers still use copies[4][7].
  • Notification latency: Requires explicit polling of error queues[2][4].
  • Protocol limitations: Works best with TCP/UDP; headers are still copied for small packets[1][4].
  • Memory management: Pinned pages increase memory pressure for long-lived connections[4][12].

Real-World Impact

  • Content Delivery Networks (CDNs): 5–8% overall performance boost[7].
  • NVMe/TCP storage: Enables line-rate 100 GbE performance with 8K blocks[13].
  • Machine learning pipelines: Reduces data prep overhead in distributed training[7].

MSG_ZEROCOPY is particularly effective for applications sending large payloads over high-speed networks, though proper buffer management and notification handling are critical. For sub-10KB transfers or local communication, traditional methods may remain preferable[1][4][11].

Citations:

  • 1: https://blogs.oracle.com/linux/post/zero-copy-networking-in-uek6
  • 2: https://patchwork.ozlabs.org/patch/808430/
  • 3: https://www.sobyte.net/post/2022-03/golang-zero-copy/
  • 4: https://www.kernel.org/doc/html/latest/networking/msg_zerocopy.html
  • 5: https://gist.github.com/congwang/c1dafe8be4a756fc1a22b53a3d42c716
  • 6: https://developer.ibm.com/articles/j-zerocopy/
  • 7: https://netdevconf.info/2.1/papers/debruijn-msgzerocopy-talk.pdf
  • 8: https://github.com/torvalds/linux/blob/master/tools/testing/selftests/net/msg_zerocopy.c
  • 9: https://dl.acm.org/doi/pdf/10.1109/SCW63240.2024.00111
  • 10: https://www.youtube.com/watch?v=9F8I9vYQydU
  • 11: https://fasterdata.es.net/host-tuning/linux/app-programming/
  • 12: https://netdevconf.info/0x18/docs/netdev-0x18-paper14-talk-paper.pdf
  • 13: https://ci.spdk.io/download/2020-virtual-forum-us/02_SPDK_NVMe_over_TCP_Ben_Seth.pdf
  • 14: https://blogoverflow.hashnode.dev/unleashing-the-power-of-zero-copy-networking-revolutionizing-data-transfer-efficiency
  • 15: https://lwn.net/Articles/730010/

Action Required: Refactor MSG_ZEROCOPY Implementation

The current TODO comment hints at the need for proper buffer ownership when exploring MSG_ZEROCOPY. Given that MSG_ZEROCOPY can offer significant throughput improvements and lower CPU usage in high-load scenarios—but only when buffer management is correctly handled—the added complexity should not be mixed into the existing API.

  • Separate Function: Implement the MSG_ZEROCOPY logic as a discrete function. This isolation will allow clients to opt in to the more complex, high-performance path without disrupting the standard behavior.
  • Buffer Ownership & Notification Handling: Ensure that this new API carefully manages buffer lifetimes and handles error notifications (i.e., using the kernel error queue) to prevent premature buffer deletion.

Originally posted by @coderabbitai[bot] in https://github.com/Eeems-Org/oxide/pull/397#discussion_r1996603928

Eeems avatar Mar 15 '25 05:03 Eeems