Archipelago
Archipelago copied to clipboard
Webhost: Add a /downloads page hosting the latest release
What is this fixing or adding?
Title. There are no links to it currently as I'm unsure where they should be other than possibly the setup guides which is very out of scope for this. Requests the files from github and saves them to the db on first request for any of the files. There's likely a better way to do the data storage on startup but I couldn't get it to work easily, and this works well enough. Adds an api endpoint for each specific file, and a page with links to those endpoints.
How was this tested?
Ran webhost and clicked all the buttons.
If this makes graphical changes, please attach screenshots.
An approach question - am I understanding correctly that you're downloading and caching the files and self hosting, rather than just redirecting the user to the GitHub download? If so, why? I would think that approach should be preferable as we wouldn't have to host the file. Plus it would eventually allow AP updates decoupled from webhost updates
An approach question - am I understanding correctly that you're downloading and caching the files and self hosting, rather than just redirecting the user to the GitHub download? If so, why?
We've managed to hit GitHub's rate limiting multiple times on new releases. By self-hosting we have another link to send people to without having to worry about that. The other beneficial bit is being able to display names different than the raw file names to improve the sign posting for users to get the correct version. I have no doubt that a lot of the windows 7 downloads aren't windows 7 users.
I would think that approach should be preferable as we wouldn't have to host the file. Plus it would eventually allow AP updates decoupled from webhost updates
In this tentative future, decoupled webhost updates would presumably not have any actual AP changes, just new games pulled in, and whether there's a new release or not would be unaffected by such behavior.
In this tentative future, decoupled webhost updates would presumably not have any actual AP changes, just new games pulled in, and whether there's a new release or not would be unaffected by such behavior.
I'm thinking more about the flip side, where AP updates without the website needing to update. Thinking about it more this might not be feasible for a variety of reasons but the point is it would cache an outdated download in this scenario
Putting my thoughts here for discussion. I still think this is something berserker has to decide ultimately, because I don't manage the server.
- Imo we should prefer direct downloads from github if possible, because they have multiple data centers with more bandwidth - could either provide an "alt" or "mirror" link, or have the webhost try to determine if github is down somehow. Maybe an api request that is cached for a few minutes, and if it times out, switch to local downloads?
- With the current code, we lose stats. Using the main database for that isn't great, but would probably work. I.e. add a download_count field to the model.
- Going through python and the DB for a big download is bad. It produces a lot of unnecessary IO, and does not work for some HTTP features such as
Range:
. The actual file should be put in a static route and the client should be directed to that, however the link displayed should still hit your python route that updates the stats before redirecting to the static route viareturn redirect(url_for("static", filename="..."), code=302)
. Note that as long as we write the stats directly to the main DB, nothing should be cached, when switching to something like a redis "clone", the db row / redirect target could be cached (but not the endpoint). The best performing solution that gives the most accurate numbers would be to have nginx trigger the stats update directly and link directly to the static file, however I don't think that's necessary.
As a side note, setting up a redis clone on the webhost sounds like something we should do at some point anyway, because memcached is currently defunct in our deployment and caching the same thing multiple times in memory (because we have multiple processes serving the webhost) is wasteful.
I agree with BlackSilver that the downloads shouldn't be hosted locally. They should be sourced from GutHub, which has the existing infrastructure of a CDN already.
Sending a GET request to this URL will return JSON data for the most recent release of Archipelago, including the release notes: https://api.github.com/repos/ArchipelagoMW/Archipelago/releases/latest
I recommend making a request to the GitHub API every time the user hits the page, thereby ensuring the most recent release is always available on the site, and not requiring the WebHost to be restarted to pick up any new release files.
Regarding style, I recommend using the island style, as this page does not have a lot of content in its body.
I recommend making a request to the GitHub API every time the user hits the page
GitHub api endpoints are rate-limited, so I strongly advise against that and would suggest to have the server fetch and cache it for a few minutes. If we create a DB entry from this or not is a separate question, however if we want to provide an alt download for cases where GitHub is not accessible, then I'd say yes.
The current rate-limit for anonymous connections is 60 per hour per IP address, as shown in the response headers x-ratelimit-limit: 60
. This is problematic on shared internet connections (like a campus), and also for a server that accesses GitHub API per user request. (For completeness sake, the rate limit with auth is 5000, but the auth can't be shared with the client, and also without caching the 5000 may not be enough when this becomes the primary way to download.)
If we cache it for a few (3? 5?) minutes, we will not even have to create an access token on the server side, however caching would ideally be shared between multiple processes of gunicorn for this, which I think currently it isn't.
My goal when suggesting requesting against the API was to prevent needing to restart the server to pick up new files. Caching the links for five minutes seems entirely reasonable. I didn't look closely at the rate limit, and wasn't aware it was so low.
so is the resolution decision to keep a local timer and when the user clicks the link make an API request if last checked > 5 minutes, and just always redirect to the github download link? do we still want to host at least a mirror for if the API request fails? i also think 5 minutes is shorter than really necessary
so is the resolution decision to keep a local timer and when the user clicks the link make an API request if last checked > 5 minutes
It's not necessarily a timer, you can instead cache all routes that would fetch the json, i.e.
@cache.cached(timeout=300) # cache for 5 minutes to avoid rate limit
instead of the blank
@cache.cached
you already have.
If you don't put the direct link into the response (see also below), but instead want to redirect the user, you have to also cache the detection where to redirect to, i.e. GitHub or static_file, since that'd need to be aware of GitHub's availability.
and just always redirect to the GitHub download link?
I would fetch the json on the server side when the user goes to the overview page and from there you can either create direct links or update the DB entry, or both. Downloading a new file for local mirroring has to use some sort of lock so multiple processes don't try download the same file. Note that urlopen has a timeout argument that can be used to detect if GitHub API seems to be unresponsive.
do we still want to host at least a mirror for if the API request fails?
In my opinion, yes. We can either always put a (mirror)
link there, that always is static_file, or we can swap out the link (that then gets cached for 5 minutes), or have a single route that redirects the user accordingly. Be sure to only count the local downloads to download stats so we don't count anything double - i.e. GitHub numbers + local stats = total downloads.
i also think 5 minutes is shorter than really necessary
I don't think so. If we add a file to the release with a delay (as is done for Linux builds) it's nice to not have to wait forever. If we also use that to detect if GitHub is down, then a shorter interval would be even better.
-- I mean this is now kind of 2 features in one:
- add download page to archipelago.gg
- host downloads locally for cases where GitHub is unresponsive
So we can also split it into two separate PRs, if you prefer, however both LegendaryLinux and I think that using GitHub for the normal route (1.) is preferable because of available bandwidth.
addressed everything as requested. the only bit that isn't done as I was unsure how to approach is storing the number of downloads.