v6d icon indicating copy to clipboard operation
v6d copied to clipboard

Optimize the speed of concurrent get of pytorch models

Open dashanji opened this issue 1 year ago • 0 comments

Describe your problem

Currently, getting a pytorch module at high concurrency is very slow as follows. The test machines's max bandwidth are both 30Gbps.

Vineyard

Concurrencies Time of getting Observed Network Bandwith from Dstat
1 2.57s around 2000Mi
6 7.73s around 3800Mi
13 14.58s around 3800Mi
27 29.32s around 3800Mi

Iperf

Concurrencies Observed Network Bandwith from Dstat Total Network bandwidth
1 around 1470Mi 12Gbits/s (1500Mib/s)
6 around 3700Mi 31.1Gbit/s (3888Mib/s)
13 around 3650Mi 30.9Gbit/s (3863Mib/s)
27 around 3650Mi 30.9Gbit/s (3863Mib/s)

Solution

In the actual scenery, the pytorch models used to be loaded in the machine with GPU, which always have high- performance networks. Thus, the bandwidth of vineyardd instance is the bottleneck. We can distribute the PyTorch model blobs among different Vineyard instances to increase network bandwidth.

dashanji avatar May 07 '24 03:05 dashanji