gvisor icon indicating copy to clipboard operation
gvisor copied to clipboard

1-5 Mbit/s upload throughput in container but GCP H100 host gives >2000 Mbit/s

Open thundergolfer opened this issue 1 year ago • 13 comments

Description

Within gVisor runsc we're seeing extremely low upload performance on GCP H100 instances specifically. We don't have these issues on GCP A100 instances.

I have attached pcap data below in place of runsc debug logs. Let me know of any other info I should gather 🙂.

runsc

Retrieving speedtest.net configuration...
Testing from Google Cloud (35.221.7.106)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Cox - Nova (Fairfax, VA) [23.95 km]: 2.637 ms
Skipping download test
Testing upload speed......................................................................................................
Upload: 1.58 Mbit/s

host

[modal@gcp-h100-us-east4-a-0-c965c22f-6d1f-416d-b245-395141187d95 ~]$ ./speedtest-cli
Retrieving speedtest.net configuration...
Testing from Google Cloud (35.221.7.106)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Pilot Fiber (Ashburn, VA) [42.41 km]: 2.762 ms
Testing download speed................................................................................
Download: 3275.77 Mbit/s
Testing upload speed......................................................................................................
Upload: 2456.03 Mbit/s

runc

Retrieving speedtest.net configuration...
Testing from Google Cloud (35.221.7.106)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Cox - Nova (Fairfax, VA) [23.95 km]: 2.467 ms
Skipping download test
Testing upload speed......................................................................................................
Upload: 808.07 Mbit/s

Steps to reproduce

I unfortunately don't have much of a chance of getting a devbox with an H100 on it. But we're just doing this:

curl -Lo speedtest-cli https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py
chmod +x speedtest-cli
./speedtest-cli

We see similar upload performance problems when uploading to Cloudflare R2.

runsc version

version release-20230717.0-12-g0244c8c19fb7
spec: 1.1.0-rc.1

docker version (if using docker)

No response

uname

Linux gcp-h100-us-east4-a-0-c965c22f-6d1f-416d-b245-395141187d95 5.15.0-205.149.5.1.el9uek.x86_64 #2 SMP Fri Apr 5 11:29:36 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

output.pcap.zip (gvisor) output_runc.pcap.zip

thundergolfer avatar May 01 '24 02:05 thundergolfer

Thanks for the report. Would it be possible to get equivalent pcaps for runc/runsc on the A100 where you don't see the issue?

manninglucas avatar May 01 '24 17:05 manninglucas

Also just to be sure could you confirm if the A100/H100 are running in the same region?

manninglucas avatar May 01 '24 18:05 manninglucas

Looking at the pcaps for both runsc and runc, it looks like every packet is repeated. Even the initial SYN shows up twice, and this isn't normal "TCP is trying again after a timeout" behavior -- there's only a 7µs gap between packets. Any idea why? It really messes with wireshark.

Screenshot 2024-05-02 at 9 42 55 AM

kevinGC avatar May 02 '24 16:05 kevinGC

Yeh this is weird. I just did sudo tcpdump -i any -w output.pcap host $CONTAINER_IP on the host.

I'll capture from an A100 and check if the same weirdness is present.

thundergolfer avatar May 02 '24 17:05 thundergolfer

Oh, I see the issue: -i any is getting the packet on two interfaces. You can see it switch between two Interface index values in wireshark. I'm assuming this is capturing the packet once on the host NIC and once on the virtual ethernet device that runsc is using.

You can target a specific interface. Also, the wireshark logs can be filtered via sll.ifindex == <index>, but it's messier and more cumbersome.

kevinGC avatar May 02 '24 17:05 kevinGC

Ok, have some cleaner captures, captured with:

H100 Details

  • Upload inside container: 1-5 Mbit/s
  • Instance type: a3-highgpu-8g
  • Zone: us-east4-a
  • Instance ID (GCP): 7905391184063231222

A100 details

  • Upload inside container: 220 Mbit/s - 650 Mbit/s
  • Instance Type: a2-megagpu-16g
  • Zone: us-central1-c
  • Instance ID (GCP): 7944836249609026522

I'm a Wireshark novice, but the H100 dump (left) is full of duplicate ACKs and packet retransmissions, whereas the A100 dump (right) is clean.

image

thundergolfer avatar May 07 '24 19:05 thundergolfer

Thanks for extra logs, we're still investigating on our end. Could you send what you get from running ip link show on the H100 and A100? Believe it or not we have trouble getting access to these types of machines even for our own testing.

manninglucas avatar May 08 '24 21:05 manninglucas

Believe it or not we have trouble getting access to these types of machines even for our own testing.

😅 jeez, it's rough out there.

H100

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 42:01:ac:1e:50:2f brd ff:ff:ff:ff:ff:ff
    altname enp0s12
3: modalsvc0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 82:5f:c2:c0:ba:fb brd ff:ff:ff:ff:ff:ff
4: modal59: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 02:c9:6a:77:d2:5a brd ff:ff:ff:ff:ff:ff
6: modal57: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 46:a1:b0:76:32:e2 brd ff:ff:ff:ff:ff:ff
8: modal9: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 7a:57:61:ca:cd:71 brd ff:ff:ff:ff:ff:ff
10: modal45: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 56:57:5a:b7:a2:bf brd ff:ff:ff:ff:ff:ff
12: modal47: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether b6:16:6e:a0:79:87 brd ff:ff:ff:ff:ff:ff
14: modal16: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether e6:34:96:7c:b1:da brd ff:ff:ff:ff:ff:ff
17: modal34: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 06:3d:b7:c2:ee:40 brd ff:ff:ff:ff:ff:ff
20: modal40: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 9a:ec:c6:00:70:72 brd ff:ff:ff:ff:ff:ff
22: modal1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 82:62:e6:1c:ac:c4 brd ff:ff:ff:ff:ff:ff
24: modal35: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 46:95:a9:b1:c2:9c brd ff:ff:ff:ff:ff:ff
3097: veth07e2841d@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue master modal11 state UP mode DEFAULT group default
    link/ether 86:41:b8:7d:97:43 brd ff:ff:ff:ff:ff:ff link-netns wFJGhMiC7nz
26: modal4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether a2:c0:b5:38:3a:39 brd ff:ff:ff:ff:ff:ff
31: modal52: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 9a:ba:2f:48:ee:87 brd ff:ff:ff:ff:ff:ff
34: modal20: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether d2:63:ee:c5:1a:bc brd ff:ff:ff:ff:ff:ff
291: modal49: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 66:d3:7d:55:a2:ac brd ff:ff:ff:ff:ff:ff
36: modal54: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 32:ad:6a:af:9b:04 brd ff:ff:ff:ff:ff:ff
38: modal53: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 7a:b5:86:bf:59:31 brd ff:ff:ff:ff:ff:ff
40: modal27: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether e6:c2:34:dc:3c:e3 brd ff:ff:ff:ff:ff:ff
42: modal55: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether b2:eb:9e:ff:02:85 brd ff:ff:ff:ff:ff:ff
44: modal60: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 92:05:ba:e9:53:03 brd ff:ff:ff:ff:ff:ff
301: modal36: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether f6:6f:d3:9e:21:b8 brd ff:ff:ff:ff:ff:ff
46: modal28: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 4a:77:35:9c:fc:3f brd ff:ff:ff:ff:ff:ff
49: modal51: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 5e:b0:16:43:cb:45 brd ff:ff:ff:ff:ff:ff
52: modal23: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 6a:31:ff:23:b4:ca brd ff:ff:ff:ff:ff:ff
54: modal46: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 86:3d:45:1d:48:99 brd ff:ff:ff:ff:ff:ff
56: modal41: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 6e:47:98:07:b8:99 brd ff:ff:ff:ff:ff:ff
58: modal25: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ce:e2:e5:b3:b2:29 brd ff:ff:ff:ff:ff:ff
60: modal15: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 2a:76:1a:e1:6e:d2 brd ff:ff:ff:ff:ff:ff
64: modal30: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 16:f9:a3:52:03:fa brd ff:ff:ff:ff:ff:ff
66: modal48: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ca:0d:c9:67:dd:67 brd ff:ff:ff:ff:ff:ff
324: modal32: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 0a:64:ed:d2:5c:24 brd ff:ff:ff:ff:ff:ff
68: modal62: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 7e:92:07:97:36:6c brd ff:ff:ff:ff:ff:ff
72: modal2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether da:3b:f8:97:bf:09 brd ff:ff:ff:ff:ff:ff
76: modal24: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 26:cb:20:2b:83:77 brd ff:ff:ff:ff:ff:ff
81: modal38: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 4e:c5:e7:7e:00:dc brd ff:ff:ff:ff:ff:ff
83: modal50: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 1a:ca:b6:59:2a:15 brd ff:ff:ff:ff:ff:ff
86: modal6: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 22:dc:66:3c:47:3c brd ff:ff:ff:ff:ff:ff
89: modal17: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ee:7d:22:37:39:cd brd ff:ff:ff:ff:ff:ff
92: modal14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ba:d7:59:92:89:89 brd ff:ff:ff:ff:ff:ff
94: modal63: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 2a:8f:23:5a:97:fd brd ff:ff:ff:ff:ff:ff
100: modal3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 02:37:3f:8c:32:d4 brd ff:ff:ff:ff:ff:ff
102: modal43: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 06:48:c5:9e:89:0f brd ff:ff:ff:ff:ff:ff
104: modal0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ba:ab:65:70:03:a8 brd ff:ff:ff:ff:ff:ff
106: modal31: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 0a:94:08:e1:fe:6c brd ff:ff:ff:ff:ff:ff
111: modal61: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 9e:2a:d8:6d:27:c7 brd ff:ff:ff:ff:ff:ff
113: modal12: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether d6:4e:c3:ff:31:e0 brd ff:ff:ff:ff:ff:ff
115: modal19: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ea:35:f9:9e:44:44 brd ff:ff:ff:ff:ff:ff
117: modal21: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 42:13:9d:f8:c0:68 brd ff:ff:ff:ff:ff:ff
120: modal58: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 16:f8:85:69:87:72 brd ff:ff:ff:ff:ff:ff
5243: veth31f0a7f6@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue master modal4 state UP mode DEFAULT group default
    link/ether 9e:52:78:fd:ba:69 brd ff:ff:ff:ff:ff:ff link-netns unwYFZtG8ao
135: modal26: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether f6:5c:f8:e9:d7:21 brd ff:ff:ff:ff:ff:ff
138: modal33: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 32:7a:39:40:2c:c7 brd ff:ff:ff:ff:ff:ff
140: modal22: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether de:6c:6c:64:7e:9b brd ff:ff:ff:ff:ff:ff
150: modal44: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 02:76:ba:b5:d9:b6 brd ff:ff:ff:ff:ff:ff
156: modal37: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 1a:ba:63:12:fd:c9 brd ff:ff:ff:ff:ff:ff
164: modal29: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 6a:2c:3c:c3:9f:8a brd ff:ff:ff:ff:ff:ff
168: modal42: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ce:ff:6d:19:57:86 brd ff:ff:ff:ff:ff:ff
171: modal10: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether e2:6f:75:52:d0:4c brd ff:ff:ff:ff:ff:ff
176: modal39: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 06:9f:0d:e6:83:02 brd ff:ff:ff:ff:ff:ff
180: modal18: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 5e:5e:9e:ad:88:7b brd ff:ff:ff:ff:ff:ff
186: modal7: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 26:25:25:9c:15:a5 brd ff:ff:ff:ff:ff:ff
194: modal56: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether e2:2b:1b:3a:1a:e1 brd ff:ff:ff:ff:ff:ff
201: modal11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether a2:8c:b9:9b:ff:62 brd ff:ff:ff:ff:ff:ff
214: modal5: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 82:41:43:13:b2:e7 brd ff:ff:ff:ff:ff:ff
223: modal8: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 72:c0:7b:4f:1f:b5 brd ff:ff:ff:ff:ff:ff
239: modal13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 32:a5:8d:fa:2e:2d brd ff:ff:ff:ff:ff:ff

A100

[modal@gcp-a100-80gb-spot-europe-west4-a-0-70db3533-efb2-4ff1-86e2-ed9 ~]$ ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 42:01:ac:1e:70:09 brd ff:ff:ff:ff:ff:ff
    altname enp0s9
    altname ens9
3: modalsvc0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 66:65:81:26:53:f5 brd ff:ff:ff:ff:ff:ff
4: modal35: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether aa:1d:c0:a3:66:9f brd ff:ff:ff:ff:ff:ff
6: modal22: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether b6:02:95:74:73:a4 brd ff:ff:ff:ff:ff:ff
8: modal2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 26:f4:6d:e1:80:df brd ff:ff:ff:ff:ff:ff
10: modal62: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether e6:51:9f:ec:ef:78 brd ff:ff:ff:ff:ff:ff
12: modal60: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 5a:97:2c:03:c4:fc brd ff:ff:ff:ff:ff:ff
14: modal57: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 06:fe:bb:61:f8:f1 brd ff:ff:ff:ff:ff:ff
16: modal17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f2:80:dc:e6:43:29 brd ff:ff:ff:ff:ff:ff
17: vetha0b77138@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue master modal17 state UP mode DEFAULT group default
    link/ether e6:30:bf:c5:4d:10 brd ff:ff:ff:ff:ff:ff link-netns xuZY6HwlXOW
18: modal29: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 26:53:e6:c8:97:3e brd ff:ff:ff:ff:ff:ff
20: modal41: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether fa:db:f4:cc:41:44 brd ff:ff:ff:ff:ff:ff
22: modal23: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether da:da:fc:8b:2e:b7 brd ff:ff:ff:ff:ff:ff
24: modal40: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ce:57:b5:d9:49:f1 brd ff:ff:ff:ff:ff:ff
26: modal53: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether ca:ea:c1:e0:0f:9e brd ff:ff:ff:ff:ff:ff
28: modal0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 2e:51:ad:58:b0:16 brd ff:ff:ff:ff:ff:ff
30: modal14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether fa:4d:10:f3:bc:e9 brd ff:ff:ff:ff:ff:ff
34: modal42: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 82:9a:5c:fc:39:c3 brd ff:ff:ff:ff:ff:ff
36: modal10: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 12:3a:c8:b3:88:5d brd ff:ff:ff:ff:ff:ff
38: modal15: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 8e:38:02:02:1e:fa brd ff:ff:ff:ff:ff:ff
40: modal43: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 66:82:8a:4b:c9:4d brd ff:ff:ff:ff:ff:ff

thundergolfer avatar May 09 '24 03:05 thundergolfer

I have a theory as to what's going on, although there are a couple open questions WRT the packet captures. Looking at the A100 (fast) capture, we see a healthy connection. Packets are sent that appear larger than MTU, but that's because we're using GSO to defer segmenting to the last minute.

The H100 (slow) logs seem to retransmit all larger than MTU packets, indicating that we're not GSOing correctly. That makes sense: we're supplying 1500 as the MTU. This is what 94c10243701c6a5d884c0f5f106d65ad34e6729d addresses: we should be using the MTU of the device interface, not the container's.

Our loss detection (TCP RACK) appears to resend non-GSO'd segments, so when we send a too-large packet it can be seen getting retransmitted in smaller chunks. You can see an example of this right at the start of stream 20 (filter tcp.stream eq 20), where we send 4140 and 2760 B packets that take ~0.8s to get retransmitted in smaller segments. I'm not sure why we send the smaller RACK segments -- could be an implementation detail, could be part of the RFC.

The two actually confusing things are: (1) why this is different on the two machines. They both run 1500 byte MTU containers with a 1460 byte MTU NIC. I'll try to figure it out, but for now ¯\_(ツ)_/¯. And (2), why don't we see an ICMP fragmentation needed packet in the logs?

Anyways, I think 94c10243701c6a5d884c0f5f106d65ad34e6729d will fix this. @manninglucas: should we put logic similar to that in the default runsc boot process, where it uses the MTU of the default device iff there's an obvious default? #10419 should also help cover more cases where the PMTU causes problems.

kevinGC avatar May 09 '24 18:05 kevinGC

@thundergolfer In addition to testing at head with 94c10243701c6a5d884c0f5f106d65ad34e6729d, do you know whether you do anything to change the MTU inside runsc or the network namespace in which it runs?

kevinGC avatar May 10 '24 00:05 kevinGC

Noted also that H100 instances appear to always use gVNIC, while A100 can use gVNIC or virtio, defaulting to the latter I believe. Maybe part of the issue, but I'm not seeing issues when trying to repro on a gVNIC machine.

kevinGC avatar May 10 '24 22:05 kevinGC

Thanks for the details comments @kevinGC! I think they mostly make sense to me, but I'll work through the details more carefully tomorrow while also testing out https://github.com/google/gvisor/commit/94c10243701c6a5d884c0f5f106d65ad34e6729d.

To answer your follow-up question, in our CNI bridge plugin configuration we set the MTU to 1460 on GCP workers because our VPC has that set as the MTU. We first observed networking problems when running A3 instances on GCP. We'd had no trouble before. This is what we observed:

Containers on the GCP H100s have a hard time talking to the internet. They seem to manage to set up TCP connections, three way handshake succeeds, but then it seems like packets from the remote endpoint get lost and the flow stalls. For https the TLS handshake fails. - Dano from Modal

Noted also that H100 instances appear to always use gVNIC, while A100 can use gVNIC or virtio, defaulting to the latter I believe. Maybe part of the issue, but I'm not seeing issues when trying to repro on a gVNIC machine.

This very well could be why we didn't have an issue until using A3 instances.

thundergolfer avatar May 11 '24 03:05 thundergolfer

Testing result:

https://github.com/google/gvisor/commit/94c10243701c6a5d884c0f5f106d65ad34e6729d

[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ ./runsc --version
runsc version release-20240506.0-13-g94c10243701c
spec: 1.1.0-rc.1
[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ sudo ./runsc do ./speedtest-cli --secure
Retrieving speedtest.net configuration...
Testing from Google Cloud (34.48.63.7)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by PhoenixNAP Global IT Services (Ashburn, VA) [42.41 km]: 4.03 ms
Testing download speed................................................................................
Download: 2425.78 Mbit/s
Testing upload speed.....................................................................................................
.Upload: 4.54 Mbit/s
[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ sudo ./runsc do ip link show
2: ve-runsc-443872: <UP,LOWER_UP> mtu 1460
    link/ether 16:3d:af:7b:ad:1b brd ff:ff:ff:ff:ff:ff
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65522
    link/loopback 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff

Host view of network device (on a different but equivalent do run)::

ip link show | grep -A2 runsc
1381: vp-runsc-630021@if1382: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 2e:58:0e:b3:cf:f8 brd ff:ff:ff:ff:ff:ff link-netns runsc-630021

Status quo

[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ ./production/runsc --version
runsc version 6e61813c1b37
spec: 1.1.0-rc.1
[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ sudo ./production/runsc do ./speedtest-cli --secure
Retrieving speedtest.net configuration...
Testing from Google Cloud (34.48.63.7)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by PhoenixNAP Global IT Services (Ashburn, VA) [42.41 km]: 2.583 ms
Testing download speed................................................................................
Download: 0.00 Mbit/s
Testing upload speed......................................................................................................
Upload: 34.14 Mbit/s
[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$
[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ sudo ./production/runsc do ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65522
    link/loopback 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
2: ve-runsc-154717: <UP,LOWER_UP> mtu 1500
    link/ether 2a:5c:07:9e:7d:e7 brd ff:ff:ff:ff:ff:ff

Host view of network device (on a different but equivalent do run):

[modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ ip link show | grep -A2 runsc
1383: vp-runsc-012936@if1384: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether ea:b9:d9:fe:06:a7 brd ff:ff:ff:ff:ff:ff link-netns runsc-012936

Unexpected results. So with status quo runsc version do has zero download throughput but some upload throughput. With https://github.com/google/gvisor/commit/94c10243701c6a5d884c0f5f106d65ad34e6729d runsc version the download is high but upload is lower than status quo?

Seems odd to me that the host shows MTU is still 1500 even though the in https://github.com/google/gvisor/commit/94c10243701c6a5d884c0f5f106d65ad34e6729d the container is now picking up the MTU of eth0.

thundergolfer avatar May 11 '24 17:05 thundergolfer

As a workaround: you can pass runsc a --gso=false flag that should get you close to native speeds. @manninglucas got us a test machine and -- with our particular setup -- the upload throughput goes from ~1.65 Mbps to 745 Mbps!

We have some ideas regarding the root cause -- the H100's NIC driver may be in some way different -- that we'll keep looking into for now.

kevinGC avatar May 15 '24 00:05 kevinGC

--gso=false does indeed improve upload!

[modal@gcp-h100-us-east4-a-0-a275c742-c07d-433e-bcc0-46bf967048d7 ~]$ sudo ./production/runsc -gso=false do ./speedtest-cli --secure
Retrieving speedtest.net configuration...
Testing from Google Cloud (34.86.32.183)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Whitesky Communications LLC (Ashburn, VA) [42.41 km]: 2.482 ms
Testing download speed................................................................................
Download: 0.00 Mbit/s
Testing upload speed......................................................................................................
Upload: 326.61 Mbit/s

There's still the same 0.00 Mbit/s download with runsc do, but can put that aside.

We have used --gso=false in the past and it degraded performance https://github.com/google/gvisor/issues/9816#issuecomment-1885669327, but we can selectively enable it for H100s because it'll be better 👍

thundergolfer avatar May 15 '24 18:05 thundergolfer

Glad that flag works. Not sure where that awful download stat comes from; I don't see it when I try to replicate at any commit. Will keep looking, especially if you're still seeing it after these patches.

kevinGC avatar May 15 '24 22:05 kevinGC

I spent a few hours trying to figure out what can be wrong with our gso packets. At some point, I started thinking that we were looking for a black cat in a dark room. I decided to test this version by running a kata container and checking that the issue is reproducible in that environment. A kata container is a virtual machine with a virtio network device. It injects gso packets from guest to the host linux kernel in a similar way as gvisor does, but they use different kernel API to do that. Inside a kata vm, the linux kernel is running, so it is completely unrelated to the gVisor netstack. It was not a surprise when I found that the same issue is triggered in kata containers:

# uname -a
Linux 41339094ec18 6.1.62 #1 SMP Wed May 15 05:03:25 UTC 2024 x86_64 Linux
/ # lspci
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:01.0 Communication controller: Red Hat, Inc. Virtio console
00:02.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
00:03.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI
00:04.0 Unclassified device [00ff]: Red Hat, Inc. Virtio RNG
00:05.0 Communication controller: Red Hat, Inc. Virtio 1.0 socket (rev 01)
00:06.0 Mass storage controller: Red Hat, Inc. Virtio file system (rev 01)
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
01:01.0 Ethernet controller: Red Hat, Inc. Virtio network device
/ # python3 /tmp/speedtest-cli 
Retrieving speedtest.net configuration...
Testing from Google Cloud...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by StarHub Ltd (Singapore) [5.78 km]: 2.561 ms
Testing download speed................................................................................
Download: 3019.78 Mbit/s
Testing upload speed......................................................................................................
Upload: 2.26 Mbit/s

In summary, I'm inclined to believe that this issue isn't tied to gVisor. More likely, it resides either within the Linux kernel itself or within the gvnic device or its driver.

avagin avatar May 16 '24 14:05 avagin

To those interested, @avagin found the cause of this bug. It's a small issue with the GVE network driver that's used on some GCP hardware. The driver code can be found here.

This code in the driver drops a packet if the GSO type isn't exactly equal to SKB_GSO_TCPV4 or SKB_GSO_TCPV6. In the case of gVisor/Kata/any process that injects packets with virtio net headers already set, the kernel actually marks these packets with an additional flag SKB_GSO_DODGY. The packets fail the check for SKB_GSO_TCPV4 because of this extra flag on the GSO type and get dropped. These drops only happen on H100 because those machines use a different kind of of NIC that requires packets to be written in the DQO format rather than the default. In the default path there is no equivalent check for SKB_GSO_TCPV4.

We will try to expedite a fix in the GVE driver the best we can from our end. Filing a formal support ticket with GCP may help move along the process as well.

Closing this issue now as it is not a bug with gVisor.

manninglucas avatar May 21 '24 20:05 manninglucas

Nice one!

thundergolfer avatar May 22 '24 13:05 thundergolfer

Here is the kernel fix: https://lists.openwall.net/netdev/2024/06/06/352

avagin avatar Jun 06 '24 22:06 avagin