openvsx open-vsx server consumes lots of bandwidth (and cpu time)

The open-vsx server consumes a lot of bandwidth -- a daily average of about 50Mbps -- and that excludes transfers, as those are provided by Azure. This is equal to, or most times higher than, www.eclipse.org, which even serves the "setups.zip" file used by the Eclipse IDE to check for updates. I'm just not convinved the open-vsx site is really that busy.

I see two types of "expensive" requests: https://open-vsx.org/api/-/query?extensionId=vscode.hlsl&includeAllVersions=true == 2.5M And a lot more of: POST /api/-/query or POST /vscode/gallery/extensionquery which are from 800K to 2.5M in size

Both those requests can be >15M from the upstream server, which is compressed-on-the-fly by our reverse proxy as the upstream server doesn't seem to support compression.

Do those requests really need to return several MB of data?

Nov 29 '21 17:11 eclipsewebmaster

The /vscode/gallery/extensionquery endpoint mirrors the responses returned by the same visualstudio marketplace endpoint. So we can't reduce the size by changing the request or response for this endpoint. The request for this endpoint does include paging parameters to limit the response size. I'll test if paging is applied correctly.

@spoenemann Would it be possible to limit the amount of versions that the /api/-/query endpoint returns? For example, the vscode.hlsl extension has 457 versions.

[x] Test paging for /vscode/gallery/extensionquery endpoint.
[x] Enable compression server-side, so that the reverse proxy doesn't have to compress on-the-fly.

Feb 01 '22 13:02 amvanbaren

Would it be possible to limit the amount of versions that the /api/-/query endpoint returns? For example, the vscode.hlsl extension has 457 versions.

Yes, see also https://github.com/eclipse-theia/theia/issues/10538. We should add additional parameters to control how much information is returned, and adapt Theia so it requests only what is necessary.

Feb 03 '22 10:02 spoenemann

The paging should definitely help.

The compression not so much. Indeed, the reported bandwidth consumption from https://github.com/eclipse/openvsx/issues/379#issue-1066284053 was already about compressed responses. Initially, we were compressing responses at the LB level, but eventually enabled it at the server level via https://github.com/EclipseFdn/open-vsx.org/commit/adfde558e8aafd6453b8bdd47144f9d2f3eb0183 to remove the load at the LB side and remove the (even higher) traffic on the internal network.

I'd say we keep this one open and we look at the impact of paging once merged and deployed. Thoughts?

Feb 04 '22 10:02 mbarbero

We need to evaluate whether paging would really improve the situation from the perspective of Theia. @msujew WDYT?

Feb 04 '22 10:02 spoenemann

I've removed compression from https://github.com/eclipse/openvsx/pull/412. https://github.com/eclipse-theia/theia/issues/10538#issuecomment-1028912793 proposes paging for the /api/-/query endpoint by using offset and size query parameters in the same way as the /api/-/search endpoint.

Feb 07 '22 10:02 amvanbaren

The reason for the huge response size is very simple, at least for JSON API requests of the form

https://open-vsx.org/api/-/query?extensionId=${namespace}.${name}&includeAllVersions=true

as currently used when finding compatible extension versions for Theia.

This returns an extensions array which bundles, for each available version, most recent versions first, the information that you would get from

https://open-vsx.org/api/${namespace}/${name}/${version}

Now note that the response to each version-specific query contains an allVersions map that contains cross-reference URLs to all the other versions.

Already by itself, this is a bit strange because semantically, allVersions is certainly not a property of a specific version, particularly because it changes whenever a new version is published.

Which implies that the version-specific query results are not cacheable beyond update events. Personally, I would expect a version-specific query to return something that does not have to change just because new versions are published later on. That would make it easily cacheable as well.

But worse, the response to an includeAllVersions query includes the allVersions map in each of its extensions array entries again. So, if you have n versions, the reply will contain n**2 key-value pairs of allVersions, all of them useless in the sense of being redundant.

Fundamentally, this means that the includeAllVersions feature does not scale well; it should never have O(n**2) response sizes when n can be assumed to increase linearly over time.

Consequently, as more versions keep pouring in, the server (and each client) will ultimately be overwhelmed, unless their network capacity also grows quadratically over time.

How significant is this currently? Take vscode.bat with currently 487 versions. Expect therefore a total of about 237169 allVersions entries. Indeed the JSON response to

https://open-vsx.org/api/-/query?extensionId=vscode.bat&includeAllVersions=true

is 20,377,701 bytes long. Removing all the allVersions would reduce the response size to 650812 bytes, a mere 3.2% of the original response size. Which means that more than 96% of the response are currently wasted on those unneeded, unrequested, improper, cache-unfriendly und ultimately DoS-causing allVersions entries.

Try it yourself, using curl and jq:

id=vscode.bat
url="https://open-vsx.org/api/-/query?extensionId=$id&includeAllVersions=true"
curl -LsS "$url" >verquery.$id.json
wc -c verquery.$id.json
jq -c 'del(.extensions[].allVersions)' <verquery.$id.json | wc -c

For most vscode.* builtins, there are currently about 457 versions, so the effects of including the alien and useless allVersions map are of a similar significance.

Theia's package.json now references an extension pack that bundles 80 plugins and excludes 7 of them for compatibility reasons. Resolving that extension pack by theia download:plugins uses includeAllVersions queries which transfer a total of 1,323,992,876 bytes. For purposes of version-filtering only. The plugins themselves consume only a fraction of that size.

With my slow internet connection that download:plugins did not work at all, causing ZBufError and ECONNRESET until helpful souls implemented a no-parallel feature for that code path. Thereafter it still took 30 minutes, and that's only because the transfer presumably uses compression. Doing all those queries with curl like above (i.e. without --compressed) took 3 hours here. So yes, the problem is already glaringly huge.

Suggestion

In the long run, you need to get rid of those n**2 allVersions maps. It has been a mistake to ever include them. They break caching and cause a steady increase in load. I'd suggest:

To include an allVersions map ONLY for unversioned queries without includeAllVersions, e.g. on query URLs of the form
```
https://open-vsx.org/api/${namespace}/${name}
```
In particular, to NOT include allVersions in the extensions array entries returned by queries with includeAllVersions=true. (Yes, I do see the superficial irony in this.)

In the meantime, you might want to deprecate expectations of the presence of an allVersions map and invent additional query params to explicitly enable/disable it.

Notes on compression

Compression of transfers indeed removes a lot of the redundancy in the version info responses. In my case, this seems to have improved things by a factor of 6.

Now, compression cannot change the asymptotic O(n**2) nature of the responses, but it does reduce the implied factor drastically. And there may be room for improvement. Watch this:

$ tar cz verquery | wc -c               # gzip
182794229
$ tar cj verquery | wc -c               # bzip2
19853075
$ tar c --xz verquery | wc -c           # xz, smallest
1540484
$ tar c --zstd verquery | wc -c         # zstd, fastest
2192405

Above, verquery is a directory holding those 1,323,992,876 bytes of includeAllVersions responses for 73 plugins. gzip reduces that to 14%, and I suppose that that is working behind the scenes when I do theia download:plugins.

bzip2 achieves better compression (down to 1.5%), but is slow.

xz is able to reduce the data to 0.11%, and that is not even surprising if you consider that each JSON response contains about 500 repetitions of almost the same information.

zstd was quickest and reduced the data to a remarkable 0.17%.

OK, using tar here is a bit unfair because that enables cross-file compression, but since the 73 response files are quite large (about 20 MB each), this still says something about the potential of those compression algorithms when applied to individual responses.

So, if server and clients can negotiate zstd compression, both computing time and transfer volume can be reduced drastically in comparison to gzip. This could give some room to breathe while deprecating and removing the allVersions mess.

May 12 '22 17:05 ccorn

Thanks @ccorn for your thorough analysis.

There are multiple endpoints that return a response with the allVersions property set: -/admin/extension/{namespaceName}/{extensionName} -/api/{namespace}/{extension} -/api/{namespace}/{extension}/{targetPlatform} -/api/-/publish -/api/-/query

Only the /api/-/query endpoint returns multiple versions, each with allVersions set. Which you rightfully pointed out has dire consequences.

The other endpoints return the latest or in case of the /api/-/publish endpoint the newest version. There it might make sense to keep the allVersions map for navigating to another version.

For the /api/-/query endpoint I'd suggest a change in the response structure to reduce redundancy. The most important change is the split between versions and versionLinks (allVersions). The other changes can be a nice bonus.

{
    "extensions": [
        {
            "name": "",
            "namespace": "",
            "namespaceUrl": "",
            "averageRating": 3.0,
            "downloadCount": 123,
            "reviewCount": 100,
            "reviewsUrl": "",
            "preview": false,    
            "verified": true,
            "versionAlias": ["latest", "pre-release"],
            "versions": [
                {
                    "version": "",
                    "targetPlatform": "",
                    "preRelease": false,
                    "publishedBy": {},
                    "timestamp": "",
                    "displayName": "",
                    "description": "",
                    "engines": {},
                    "categories": [],
                    "extensionKind": [],
                    "tags": [],
                    "license": "",
                    "homepage": "",
                    "repository": "",
                    "bugs": "",
                    "markdown": "",
                    "galleryColor": "",
                    "galleryTheme": "",
                    "qna": "",
                    "badges": [],
                    "dependencies": [],
                    "bundledExtensions": [],
                    "downloads": {},
                    "files": {
                        "download": "",
                        "readme": ""
                    }
                }
            ],
            "versionLinks": {
                "1.0.0": "",
                "0.9.2": ""
            }
        }
    ]
}

May 17 '22 18:05 amvanbaren

openvsx openvsx copied to clipboard

open-vsx server consumes lots of bandwidth (and cpu time)

openvsx
openvsx copied to clipboard