labrinth icon indicating copy to clipboard operation
labrinth copied to clipboard

`POST /v2/version` randomly fails with HTTP 502

Open Kir-Antipov opened this issue 2 years ago • 4 comments

I maintain a GitHub Action that publishes Minecraft mods to different modding platforms (including Modrinth) - Kir-Antipov/mc-publish. The problem is that some of my users experience strange behavior of your API from time to time - sometimes POST /v2/version (the route that should create a new version of the given project) returns a Cloudflare-branded HTTP 502 Bad Gateway page instead of an API response. Here is a real example of one such case:

HTTP 502 Example

More examples of failed uploads can be seen here:

  • https://github.com/fxmorin/carpet-fixes/runs/6843729255?check_suite_focus=true
  • https://github.com/fxmorin/carpet-fixes/runs/6906557785?check_suite_focus=true
  • https://github.com/DenisD3D/Mc2Discord/runs/7138343984?check_suite_focus=true

Let's go through all the nodes that could cause the problem:

  • Cloudflare
  • mc-publish
  • labrinth

Cloudflare

At first I thought that Cloudlfare itself kicks in and just forbids a request to proceed. That was before I actually looked at the response you can see in the form of image above. This is not how Cloudflare blocks suspicious calls to a website. According to their documentation, real life experience, and common sense Cloudlfare should return HTTP 429 Too Many Requests. Here's what their documentation says about Cloudflare-branded HTTP 502 pages:

Cloudflare returns an Cloudflare-branded HTTP 502 or 504 error when your origin web server responds with a standard HTTP 502 bad gateway or 504 gateway timeout error

[...]

Contact your hosting provider to troubleshoot these common causes at your origin web server:

  • Ensure the origin server responds to requests for the hostname and domain within the visitor’s URL that generated the 502 or 504 error.
  • Investigate excessive server loads, crashes, or network failures.
  • Identify applications or services that timed out or were blocked.

What can we learn from this? The page shown above indicates that something is wrong with the server itself, this is just a forwarded response wrapped with fancy Cloudflare stuff. Crossing Cloudflare off the list of suspects


mc-publish

After the last sentence of the last paragraph, I could omit this point, but let's still investigate into it.

How could mc-publish cause a problem during an upload? In theory, of course.

  1. It could exceed rate limiting, or its requests could be blocked as suspicious by Cloudflare. While it is a fair point, because mc-publish uses node-fetch/1.0 (+https://github.com/bitinn/node-fetch) User Agent (which clearly states that the request received by a server was automated), and it does not implement delays between consequential requests at the moment (it is being worked on), let me remind you that Cloudflare does not block these requests, it returns HTTP 502, not 429, nor any other 4xx. I can tell you even more, Cloudflare fully trusts these requests because they in fact are being sent by Microsoft Corporation (IP addresses of machines that execute GitHub Actions belong to them, obviously), that's why mc-publish exists in the first place - CurseForge configured Cloudflare protection to be absolutely hostile (not even strict) towards its users, i.e., CurseForge API is impossible to use from your local machine, because requests will always return Cloudflare's "Checking Your Browser before Accessing" page, and it is just not the case in the GA environment
  2. It could send a malformed request. I won't even tell you about that time I set up a reverse proxy just to find out that [object Object] was sent instead of actual multipart/form-data because of node-fetch silently casting FormData polyfill to string, while your API was absolutely correctly telling me that multipart/form-data is malformed without all that 502-nonsense, that's a story for another day. If mc-publish malforms requests, it should be something deterministic, and, therefore, reproducible. So, I forked a repo that had problems with uploading its artefacts to Modrinth and re-executed the same workflow that just failed in another repo with my modrinth token and my modrinth project id (no386Ohx, I use it for internal testing), and the next thing you know is the exact same payload (with the only difference of project id being different) that was formed in the exact same environment was successfully processed by labrinth API

Crossing mc-publish off the list of suspects


labrinth

Unlike the previous points, I have nothing to say in defense of labrinth API:

  • It returns HTTP 502 Bad Gateway, a server error
  • It had similar issues before, randomly failing with HTTP 502 on common requests
  • The first problem of this kind appeared after I moved to API v2 from v1, which worked perfectly for the same projects that now experience inability to publish their artefacts to Modrinth

Therefore, I kindly ask you to investigate the issue and find its cause. Also, do not forget, that HTTP 5xx is never a user problem ;)

Kir-Antipov avatar Jul 04 '22 11:07 Kir-Antipov

We know the issue, it’s caused by bad user input in version creation. However, due to a bug in actix-multipart, instead of returning the error it returns a 500 error and closes the connection early (hence the bad gateway).

Once that issue is fixed in our backend dependency I will close this. Sorry for the inconvenience!

Geometrically avatar Jul 04 '22 15:07 Geometrically

Do you have any ideas what counts as bad user input when the same user input with different project id is not counted as one?

Kir-Antipov avatar Jul 04 '22 16:07 Kir-Antipov

That sounds like a duplicated version number to me.

triphora avatar Jul 04 '22 16:07 triphora

The first thing I checked. Sadly, this is not it. There's a publicly accessible example of user input that was reject by Modrinth API. What could be wrong here?

{
  "project_id": "7Jaxgqip"
  "name": "Carpet-Fixes v1.10.0 for 1.19",
  "version_number": "v1.10.0",
  "changelog": "**New Rules:**    \r\n`lecternBlockDupeFix` - Fixes being able to dupe lecterns using packets  \r\n`sitGoalAlwaysResettingFix` - Fixes the SitGoal continuously restarting if the owner is offline, instead of doing the checks normally  \r\n\r\n**New Rules:** (*Related to OutOfMemory*)  \r\n`debugSimulatedOutOfMemory` - A jigsaw block with a lightning rod ontop of itself will make it throw a real out of memory exception when receiving a block update  \r\n`simulatedOutOfMemoryCrashFix` - Fixes crashes caused by `debugSimulatedOutOfMemory`  \r\n`someUpdatesDontCatchExceptionsFix` - Fixes crashes caused by block updates using the SixWayEntry update  \r\n  \r\n**Fixes:**  \r\nFixed `reIntroduceZeroTickFarms` which broke in 1.19  \r\n[#48](https://github.com/fxmorin/carpet-fixes/issues/48) - When using the rule `optimizedRecipeManager` you couldn't craft some recipes    ",
  "game_versions": ["1.19"],
  "version_type": "release",
  "loaders": ["fabric"],
  "featured": true,
  "dependencies": [],
  "primary_file": "0",
  "file_parts": ["0", "1"]
}

Kir-Antipov avatar Jul 04 '22 16:07 Kir-Antipov

This should not occur anymore since #436 was merged. Let us know if it happens again!

triphora avatar Sep 05 '22 03:09 triphora