misc-server icon indicating copy to clipboard operation
misc-server copied to clipboard

Move ever-growing *.spec.whatwg.org storage off of the VM disk

Open foolip opened this issue 5 years ago • 17 comments

This week marquee, which hosts all static whatwg.org sites, grew its disk usage past 80% of its 30GB and triggered an alert. I've increased the size to 50GB for now.

The constant increase is because of commit snapshots. We could compress on disk or deduplicate more, but it would still slowly grow, indefinitely. We shouldn't store these files on a fixed-size block device, but in an object store where there is no fixed upper limit.

DigitalOcean Spaces is a solution we could use, by letting nginx forward requests to it.

However, by still having all requests hit nginx we wouldn't be making full use of a solution like this. Spaces has a CDN feature with certificate handling, but it requires control over the DNS and is thus blocked by https://github.com/whatwg/misc-server/issues/75.

foolip avatar Nov 29 '19 14:11 foolip

To clarify, request forwarding is a backend matter and does not involve redirects?

annevk avatar Nov 29 '19 19:11 annevk

DigitalOcean Spaces doesn't support serving a website from it directly, but this is tracked in https://ideas.digitalocean.com/ideas/DO-I-318.

The smallest change that would work is to let nginx continue to handle redirects, and for requests that don't redirect proxy that to an internal Spaces endpoint. Spaces wouldn't itself ever respond with a redirect, at least not until https://ideas.digitalocean.com/ideas/DO-I-318 is fixed.

For all of the static sites, I think our requirements are:

  • many redirect rules with varying 301/302
  • control over content-type headers beyond what's inferred by file extensions
  • adding a bunch of headers like HSTS

foolip avatar Nov 30 '19 22:11 foolip

The most elaborate redirect rules are in https://github.com/whatwg/misc-server/blob/master/debian/marquee/nginx/sites/whatwg.org.conf.

foolip avatar Nov 30 '19 22:11 foolip

Sorry, to restate my question, will our end-user-visible response URLs remain unchanged?

annevk avatar Dec 02 '19 10:12 annevk

Yes, of course, any solution that doesn't give full control of the URL layout I'd just rule out :)

foolip avatar Dec 13 '19 16:12 foolip

Numbers in https://github.com/whatwg/meta/issues/161#issuecomment-598046081 suggest that everything would easily fit in a Git repo, but you can't serve a website from a repo so that doesn't solve everything here.

foolip avatar Mar 12 '20 07:03 foolip

Hijacking this issue to drop some notes about using a CDN, which isn't the same problem as running out of disk space...

Some numbers based on using goaccess to analyze /var/log/nginx/access.log.{2,3,4}.gz, which seems to be about a day's worth of requests. With all hosts mixed together, we get 872.72 GiB of requests for /. Filtering out just html.spec.whatwg.org it's 721.76 GiB. So most of our traffic is serving https://html.spec.whatwg.org/. That's what I would have expected. If we were to use an CDN, we should do it for https://html.spec.whatwg.org/ first and see what that does for us.

I'm not sure about our numbers, I'm pretty sure they're the the compressed size, but we're not using 30*872 GiB ~= 26 TiB of transfer per month, more like 4-5 TiB. So this analysis is probably all wrong :)

foolip avatar Oct 06 '20 09:10 foolip

It looks like https://www.digitalocean.com/products/app-platform/ could be something to look into for this. From a cursory view, it seems more like AppEngine, in that it supports Node.js and other languages, static content, and you don't manage the servers yourself.

foolip avatar Jan 26 '21 13:01 foolip

I have looked into using DigitalOcean spaces with nginx in front, using proxy_pass to forward requests. This would allow us to keep all the redirects, which is nice.

The main problem this runs into is that a S3-like storage bucket is just a set of named objects whose names are paths, it's not a file system. The following can't be done in the usual way and needs some other solution:

  • redirecting "directories" like /validator to /validator/, but not /faq (exists) to /faq/, and preferably not /doesnotexist to /doesnotexist/
  • serving /validator/ from /validator/index.html in the bucket (without redirecting)
  • file listings, which we currently use fancyindex for

I think that if the first problem could be solved, then the second can be done with a location directive handling anything with a trailing slash, and we could generate static directory listings where we want them.

foolip avatar Mar 25 '21 23:03 foolip

It looks like DigitalOcean Spaces is maybe particularly bad at this: S3 has a whole "website hosting mode", see e.g. their docs on index.html files. Whereas https://www.digitalocean.com/community/questions/spaces-set-index-html-as-default-landing-page seems to have seen no activity. Maybe using S3 (which we already do for PR preview) would be the right way to go here?

domenic avatar Mar 25 '21 23:03 domenic

Hmm, I hadn't consider just using AWS S3, but that would probably solve most of this. What's not great about it is that we'd depend on both DigitialOcean and S3 being healthy at all times.

What mystifies me is that neither S3 nor spaces seems to have a way to set a Location header for a specific object, but can customize Content-Type and friends. If that were possible, this would be easy enough in Spaces too.

foolip avatar Mar 25 '21 23:03 foolip

S3 has a complicated system: https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-page-redirect.html . It is a bit mystifying why they don't allow something simpler. E.g. the most flexible option, the JSON rules, is capped at 50. And the per-object redirect doesn't seem to let you choose the status codes.

domenic avatar Mar 26 '21 00:03 domenic

Probably a bad idea to diversify even further, but there's also Netlify which has very straightforward _redirects and _headers files. I can't tell if they're really meant to scale in the same way as S3, but they seem serious...

domenic avatar Mar 26 '21 00:03 domenic

If we could put objects in the bucket which the nginx front end turns into a redirect to add a slash, then I think we'd be set. (We'd also need to generate file listings but that could be a deploy step, not too hard I think.)

@domenic do you know if S3 when hosting a static web site will redirect "directories" with no trailing slash to add a slash?

One option we could look into is "deprecating" URLs with a trailing slash and writing redirect rules for the ones we currently have. But I don't love having to muck around with our URLs because we're changing the storage solution.

foolip avatar Mar 26 '21 08:03 foolip

Do you know if S3 when hosting a static web site will redirect "directories" with no trailing slash to add a slash?

From https://docs.aws.amazon.com/AmazonS3/latest/userguide/IndexDocumentSupport.html :

For example, the following URL, with a trailing slash, returns the photos/index.html index document.

http://bucket-name.s3-website.Region.amazonaws.com/photos/

However, if you exclude the trailing slash from the preceding URL, Amazon S3 first looks for an object photos in the bucket. If the photos object is not found, it searches for an index document, photos/index.html. If that document is found, Amazon S3 returns a 302 Found message and points to the photos/ key. For subsequent requests to photos/, Amazon S3 returns photos/index.html. If the index document is not found, Amazon S3 returns an error.

So, it sounds like it will 302 redirect them. That appears to be similar to what we have today (e.g. https://whatwg.org/validator currently 301 redirects to https://whatwg.org/validator/.)

domenic avatar Mar 29 '21 17:03 domenic

https://github.com/aws-samples/amazon-cloudfront-secure-static-site looks fairly promising for this.

foolip avatar Sep 06 '21 15:09 foolip

I won't be able to make time from WHATWG infra work this year, so here's a brain dump.

The /var/www/html.spec.whatwg.org/ directory on marquee is 29 GB, that's the biggest problem in any migration. As a Git repository it's 6GB, so that rules out any solution of the shape "put everything in Git and deploy on every commit". That's unfortunate, because there are many options for that.

A solution would take the shape of a storage bucket which deploys write into, and a frontend/CDN that just serves from that bucket. The hard part of that is preserving all of our redirects, and I've seen no storage buckets which have built-in redirect support that's expressive enough. (S3 has some stuff, not enough.) We would need something like https://developers.cloudflare.com/rules/url-forwarding/bulk-redirects/reference/csv-file-format/ I think.

This problem ought to be easy for someone who has experience maintaining large websites and migrating between hosting... if they were meticulous about preserving redirects.

That's all.

foolip avatar Feb 16 '24 05:02 foolip