kvm-guest-drivers-windows icon indicating copy to clipboard operation
kvm-guest-drivers-windows copied to clipboard

[netkvm] In the scenario of continuous transmission of small TCP packets, this may result in excessive paging, ultimately leading to packet loss in the network card.

Open zjmletang opened this issue 11 months ago • 14 comments

Describe the bug In the scenario of sending small TCP packets, the Nagle algorithm is enabled by default in Windows, so the protocol stack combines the small packets into larger ones to be sent to the driver. When the packet reaches the driver layer, there are many fragments obtained through NdisMAllocateNetBufferSGList for one NET_BUFFER, but the network card has limitations (such as a maximum of 64 fragments), which causes some data in this packet to be lost.

To Reproduce Steps to reproduce the behaviour:

python code for example.

while True: random1 = random.randint(0, 99) random2 = random.randint(0, 9999) message = "{},{},{}".format(random1, '12233333', random2).encode('utf-8') sock.sendall(message) sock.close()

Expected behavior I think a fragmentation limit should be added at the driver level, such as 64. For NET_BUFFER with more than 64 fragments, the scattered packets should be reassembled into one page through a copy operation.

msdn says: Miniport drivers can optimize the transmission of small or highly fragmented packets by copying them to a preallocated buffer with a known physical address. This approach avoids mapping that is not required and therefore improves system performance.

https://learn.microsoft.com/en-us/windows-hardware/drivers/network/ndis-scatter-gather-dma

Screenshots If applicable, add screenshots to help explain your problem.

Host:

VM:

  • Windows version Windows Server 2019
  • Which driver has a problem Netkvm
  • Driver version or commit hash that was used to build the driver any version after 2020.07.14

Additional context Add any other context about the problem here.

zjmletang avatar Jul 18 '23 08:07 zjmletang

Hi @zjmletang ,

Thank you for opening the issue.

Some questions and comments:

  1. Are you using a real HW virtio-net implementation? Please help us to gain the motivation to add this feature
  2. Copying is not always more efficient; actually, in the case of larger packets, we definitely want to avoid it.
  3. In any case, the decision to copy will be based on the virtio-queue size, not just 64 SG fragments.
  4. Also, it might be more than one page. For Jumbo frames, we need more.

@ybendito Any thoughts?

Best regards, Yan.

YanVugenfirer avatar Jul 18 '23 10:07 YanVugenfirer

  1. yes,we use HW virtio-net implementation. There is a limitation on the number of physical pages for a descriptor in our hardware
  2. I agree.
  3. I don't understand why it is related to the virtio-queue size. Could you please help explain?
  4. Yes, actually it may not necessarily be one page. What I meant to say was to limit the number of page divisions when converting a NET_BUFFER to descriptors, as long as we can achieve the desired effect.

zjmletang avatar Jul 18 '23 11:07 zjmletang

  1. Do you mean "packet," no? The descriptor will point to one SG fragment (if not using indirect descriptors). Or do you mean you have a problem when the indirect feature is on?
  2. Without an indirect feature, one descriptor will point to each fragment. I thought the problem is to process a certain amount of descriptors because of the limited queue size. If the limitation is the amount of fragments per packet, I suggest adjusting the virtio spec so the guest will be aware of the limitation.

YanVugenfirer avatar Jul 18 '23 11:07 YanVugenfirer

  1. I mean i have a problem when the indirect feature is on. One descriptor has too many fragments so that it exceeds the hardware limitation.

  2. yes, limitation is the amount of fragments per packet(as the code below). Before the protocol changes, would you consider adding a fixed hardware limitation, such as 64 or 256

In this function: SubmitTxPacketResult CTXDescriptor::Enqueue(CTXVirtQueue *Queue, ULONG TotalDescriptors, ULONG FreeDescriptors)

if (0 <= Queue->AddBuf(m_VirtioSGL, m_CurrVirtioSGLEntry, 0, this, m_IndirectArea.GetVA(), m_IndirectArea.GetPA().QuadPart)) { return SUBMIT_SUCCESS; }

m_VirtioSGL has too much fragments

zjmletang avatar Jul 18 '23 11:07 zjmletang

  1. According to the spec: "The device limits the number of descriptors in a list through a transport-specific and/or device-specific value. If not limited, the maximum number of descriptors in a list is the virt queue size."
  2. I prefer to adjust the specification first. For general usage limiting the indirect descriptors will cause a performance hit. That means we will have to add non-standard and non-default feature.

YanVugenfirer avatar Jul 18 '23 12:07 YanVugenfirer

ok,thank you very much!

zjmletang avatar Jul 18 '23 12:07 zjmletang

@YanVugenfirer it is not a trivial change but I think it can be done with subsystem/subvendor id in the device (exactly as the spec allows) and with the driver that takes this id (and possible the revision ) in account. Need to check, of course, that the driver on Linux also will work correctly with this.

ybendito avatar Jul 18 '23 12:07 ybendito

@YanVugenfirer and another option (probably requiring additional integration) is the vendor-specific configuration in the configuration chain, i.e. little more complicated but more flexible

ybendito avatar Jul 18 '23 13:07 ybendito

@zjmletang Do you want to discuss our possible access to HW on emails? My email is yan (at) daynix.com

YanVugenfirer avatar Jul 20 '23 08:07 YanVugenfirer

@YanVugenfirer ,Of course, I am very willing. My email is [email protected]

zjmletang avatar Jul 20 '23 08:07 zjmletang

Hi @zjmletang and @YanVugenfirer ,

I am reproducing this issue and would like to show you the detailed steps. If there are any errors, please let me know. Thanks : )

Firstly, I would like to determine whether the problem exists between two virtual machines or between a virtual machine and a physical machine. Currently, my understanding is that a virtual machine runs this program as a client, and the server is a physical machine that only receives data. Is this correct?

The server code:

import socket

HOST = '<server_ip>'
PORT = 8001

with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
    s.bind((HOST, PORT))
    s.listen()
    while True:
        conn, addr = s.accept()
        with conn:
            print('Connected by', addr)
            while True:
                data = conn.recv(1024)
                if not data:
                    break
                print(data)
                conn.sendall(data)

The code in VM:

import random
import socket

while True:
    random1 = random.randint(0, 99)
    random2 = random.randint(0, 9999)
    message = "{},{},{}".format(random1, '12233333', random2).encode('utf-8')
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.connect(('<server_ip>', 8001))
    sock.sendall(message)
    sock.close()

I have a question for @zjmletang: Where can I find the small TCP packets larger than 64? Is it possible that they are here: random1, '12233333', random2. Based on my testing, it seems that nothing is lost in either the VM or the Local.

image

The above is the test in the Local.

heywji avatar Sep 08 '23 03:09 heywji

@heywji Did you try to reproduce the problem using virtio-net adapter on QEMU and Windows guest?

ybendito avatar Dec 04 '23 19:12 ybendito

Hi @ybendito, image The version of qemu is qemu-kvm-8.0.0-7.el9.x86_64, the version of virtio-win is virtio-win-1.9.35-0.el9.iso.

Here is my virtio-net-adapter part:

/usr/libexec/qemu-kvm -m 17408 \
-device '{"id": "pcie-root-port-3", "port": 3, "driver": "pcie-root-port", "addr": "0x1.0x3", "bus": "pcie.0", "chassis": 4}' \
-device '{"driver": "virtio-net-pci", "mac": "9a:44:d9:4f:bb:cf", "id": "idUyryQr", "netdev": "idxarPu0", "bus": "pcie-root-port-3", "addr": "0x0"}' \
-netdev  '{"id": "idxarPu0", "type": "tap", "vhost": true, "vhostfd": "16", "fd": "12"}' \

heywji avatar Dec 05 '23 08:12 heywji

Updates from RH on this: https://issues.redhat.com/browse/RHEL-18187

heywji avatar Dec 06 '23 05:12 heywji