mapillary_tools icon indicating copy to clipboard operation
mapillary_tools copied to clipboard

Improving upload speeds on high-bandwidth connections

Open Jorgeminator opened this issue 2 years ago • 4 comments

Basic information

  • Release version: v0.9.4
  • System: Windows 10 Pro 21H2
  • Upload bandwidth available: ~100Mbps

Steps to reproduce behavior

Upload a large number of images, each 1-2MB in size.

Expected behavior

The uploader should utilize as much of the available upload bandwidth as possible. With 100Mbps upload bandwidth, an upload speed of at least 10MB/s could be considered satisfactory.

Actual behavior

Average bandwidth utilization stays in the 2-3MB/s range, resulting in unnecessarily long upload times when uploading multiple gigabytes of imagery.

The uploader determines a chunk_size which depends on the average image size: chunk_size = min(max(avg_image_size, MIN_CHUNK_SIZE), MAX_CHUNK_SIZE)

MAX_CHUNK_SIZE is capped at 16MB, but this value will never be reached since the logic always selects avg_image_size when dealing with small files.

Quick workaround

Forcing a larger chunk size, e.g. by setting MIN_CHUNK_SIZE and MAX_CHUNK_SIZE to 32MB improves the upload speed significantly, going from 2-3MB/s to over 12MB/s. Upload_speed

Jorgeminator avatar Oct 13 '22 19:10 Jorgeminator

This is great news!

I actually tweaked the param and didn't notice much improvement under a few network configuration. I will try again and make it more reasonable, or even configurable. Thank you for the feedback!

ptpt avatar Oct 13 '22 19:10 ptpt

Incrementing the chunk size above a certain size is pointless over non‑deterministic networks because this is neither how the IP nor HTTP work. If you want to saturate your bandwidth over a non‑deterministic network you will have to transmit one (large) block of data over multiple logical connections. The client needs to be able to split the data and the server needs to be able to put it together on the other end. Neither the IP nor HTTP implement this feature automagically. It is not trivial to implement either. Libraries do exist for this but it is going to take a lot work to implement it on both ends.

In the meantime, you can try saturating your bandwidth by for example simultaneously running multiple instances of mapillary_tools each uploading one sequence. :wink:

JakeSmarter avatar Oct 13 '22 20:10 JakeSmarter

Incrementing the chunk size above a certain size is pointless over non‑deterministic networks because this is neither how the IP nor HTTP work. If you want to saturate your bandwidth over a non‑deterministic network you will have to transmit one (large) block of data over multiple logical connections. The client needs to be able to split the data and the server needs to be able to put it together on the other end. Neither the IP nor HTTP implement this feature automagically. It is not trivial to implement either. Libraries do exist for this but it is going to take a lot work to implement it on both ends.

In the meantime, you can try saturating your bandwidth by for example simultaneously running multiple instances of mapillary_tools each uploading one sequence. 😉

This was just a quick workaround and got me close enough to saturating my connection. Splitting the upload to squeeze the last 5% out of it is not worth the hassle.

Jorgeminator avatar Oct 13 '22 20:10 Jorgeminator

@GITNE 32MB chunk_size is definitely overkill, you're right about that. But in my case, going from 1MB chunk_size to let's say 8MB or 16MB gave the best improvement.

Jorgeminator avatar Oct 14 '22 21:10 Jorgeminator

Thank you, bumping MIN_CHUNK_SIZE up to 16 MB let my uploads go from ~80 Mbps up to fully utilizing my 300 Mbps upstream!

ToeBee avatar Oct 17 '22 16:10 ToeBee

@Jorgeminator I am happy for you that you were able to improve your throughput by tweaking the chunk size. Nevertheless, I would like to note, especially to @ptpt, that the chunk size is actually primarily meant as a parameter to accomplish load balancing on an upload server and ensure equal access for all uploaders to the upload server(s). Hence, it should be the server to set forth the (max) chunk size and clients should not be able or allowed to go over it. If clients are able and allowed to basically increase the chunk size to their desire in the end this is going to lead to discrimination of lower throughput connections, perhaps even to the point of starvation. The larger the chunk size gets on the client side, the easier it will become for higher throughput connections to essentially squeeze out lower throughput connections off the line. One or two high throughput uploaders are not going to hurt much but hundreds or thousands at the same time may have a severe impact. Not that I expect any such thing happening any time soon but it will eventually happen in the long run. Thus, it should be the server only to set the chunk size per upload session. Connection throughput saturation on high throughput links should happen by using multiple logical connections.

JakeSmarter avatar Oct 17 '22 17:10 JakeSmarter

@GITNE Sounds logical. The problem is not actually the chunk size itself, but the way the client chooses it. The code already allows a chunk size up to 16MB, but the average image size will be the limiting factor unless the pictures are <1MB or >16MB. Having thousands of 2MB images will choke your upload compared to uploading 50MP/10MB photos. I don't see why a user who uploads 12MP images should be penalized while a user uploading 50MP gets the full capacity.

Jorgeminator avatar Oct 17 '22 18:10 Jorgeminator

I agree we should maximize the upload egress by default. Did some test under a good network in the PR, and it seems 16MB is a good default.

I will make a new release, and it would be great if you could give it a try see if it improves (cc @bob3bob3)

ptpt avatar Nov 01 '22 18:11 ptpt

This change does not affect video uploading though because the chunk size for video uploading is already 16MB by default.

ptpt avatar Nov 01 '22 18:11 ptpt

v0.9.5 with the upload speed improvement is released https://github.com/mapillary/mapillary_tools/releases/tag/v0.9.5

Download binaries there, or run pip install:

pip install -U mapillary_tools

ptpt avatar Nov 09 '22 03:11 ptpt