simpleinfra icon indicating copy to clipboard operation
simpleinfra copied to clipboard

Migrate docs.rs to RDS and ECS

Open jdno opened this issue 8 months ago • 11 comments

  • [ ] Merge open pull request for Packer
  • [ ] Build Docker image for watcher
  • [ ] Deploy watcher to ECS (only a single instance at a time!)
  • [ ] Figure out how to deploy all components together
    • [ ] Web server
    • [ ] Watcher
    • [ ] Build server
  • [ ] Test auto-scaling of builder on staging
  • [ ] Test deployments on staging
  • [ ] Configure AWS WAF to block IPs

Questions

  • How can the docs.rs team run one-off commands?
  • How are database migrations run as part of the deploy process?

jdno avatar Oct 13 '23 09:10 jdno

another thing we need to figure out:

  • how would the database migrations be run in the deploy process?

syphar avatar Oct 14 '23 09:10 syphar

after checking our NGINX config there is a second piece we need to solve somehow:

IP blocks.

Every now and then we have a mis-acting crawler and in these cases we blocked the source IP in NGINX on our server.

I would prefer to have this in AWS / CloudFront if possible.

Otherwise we would add this to our web container, probably configured via environment variable?

syphar avatar Oct 14 '23 09:10 syphar

Next piece we need before prod :

Access to logs

syphar avatar Oct 14 '23 10:10 syphar

For blocking IPs, we should just set up a web-application firewall (AWS WAF). I actually think that we already have one set up for docs.rs, but I'm not 100% sure.

Access to the logs is a good point! It probably makes sense to stream all logs to a central place, whether that's CloudWatch or an external tool like Datadog.

jdno avatar Oct 16 '23 11:10 jdno

@jdno Please let me know if you need a hand with any of the items in this list 🙂

meysam81 avatar Nov 20 '23 14:11 meysam81

@jdno coming from this discussion I want to add here that the docs.rs containers / servers should not be reachable directly from the internet. So all traffic needs to go through CloudFront & AWS WAF

syphar avatar Dec 01 '23 08:12 syphar

One thought I had thinking about this topic again:

  • CloudFront has a hard limit on in-progress wildcard path invalidations (15)
  • we are invalidating the crate docs after each build

from https://github.com/rust-lang/docs.rs/issues/1871#issuecomment-1268744723

Looking at https://docs.rs/releases/activity it seems we average at least 600 releases per day. If an average invalidation takes 5 minutes and we can have 15 in parallel, that's 3 invalidations per minute throughput. With 1440 minutes in a day, we could handle up to 4320 builds per day before we wind up in unbounded growth land. Of course, that's based on a significant assumption about how long an invalidation takes.

I'm not sure if we can / should handle invalidations differently, but we might think about using fastly when we rework the infra?

syphar avatar Mar 01 '24 10:03 syphar

Can't we de-duplicate invalidations if we approach the limit? E.g., a * invalidation every 5 minutes would presumably never hit the limit. Not sure how that would affect cache hit rates, but I'd expect designing around not needing invalidations or being ok with fairly blanket invalidations to be a good long-term strategy.

(I think we've had this conversation elsewhere before).

Mark-Simulacrum avatar Mar 01 '24 17:03 Mark-Simulacrum

Can't we de-duplicate invalidations if we approach the limit? E.g., a * invalidation every 5 minutes would presumably never hit the limit.

You mean "escalating" them, so when the queue is too long, just convert the queue into a full purge. This is definitely would work, but would mean that the user experience (especially outside the US) is worse until the cache is fuller again. Of course this might be acceptible for us.

being ok with fairly blanket invalidations

This also means that the backend always has to be capable to handle the full uncached load, and higher egress costs depending on how often we have to do the full purge.

I also remember a discussion at EuroRust that we could think about having additional docs.rs webservers (also readonly DB & local bucket?) in some regions (europe?).

I'd expect designing around not needing invalidations

You're right, this is a valid discussion to have. I imagine this would only work when the URLs would include something like the build-number in the URL, and replace the more generic URLs rest with redirects. If I'm not missing something this would revert some of the SEO & URL work from https://github.com/rust-lang/docs.rs/issues/1438 (introducing /latest/ URLs ). And then people would start linking specific docs-builds in their sites as they did before we had /latest/.

(I think we've had this conversation elsewhere before).

you're probably right :)

I wanted to bring it up here as a point, for when we migrate infra anyways.

syphar avatar Mar 01 '24 18:03 syphar

Note that (IMO) if we can get the cache keys setup right, i.e. everything except HTML is always at a by-hash file path - it seems to me that /latest/ can just be served with a short ttl (5 minutes), perhaps with stale-while-revalidate. That means that there's a small period where it's not necessarily consistent what version you get from it across all pages if some are cached locally and some aren't (and likewise for CDN), but I don't see any real problem with that. Users mostly won't even notice.

Yes, especially anything out of s3 can definitely be replicated if we need it to be into multiple regions pretty easily. This just causes issues while you need invalidations since you're racing against replication which can itself take some time (hours IIRC for the cheap option and minutes for the costly one?)

Mark-Simulacrum avatar Mar 07 '24 02:03 Mark-Simulacrum

Note that (IMO) if we can get the cache keys setup right, i.e. everything except HTML is always at a by-hash file path - it seems to me that /latest/ can just be served with a short ttl (5 minutes), perhaps with stale-while-revalidate. That means that there's a small period where it's not necessarily consistent what version you get from it across all pages if some are cached locally and some aren't (and likewise for CDN), but I don't see any real problem with that. Users mostly won't even notice.

Jep, everything except HTML should have already have hashed filenames, with some small exceptions. For HTML I (personally) would prefer longer caching duration, 5 minutes outdated is probably fine, not sure now much that would reduce the user happiness for some crates. I'll probably try to get better data on how the cache coverage for certain crates is at some point and see in more detail how the impact would be on users. And it might also be the case that it's just me that needs these kind of response times for docs :)

Yes, especially anything out of s3 can definitely be replicated if we need it to be into multiple regions pretty easily. This just causes issues while you need invalidations since you're racing against replication which can itself take some time (hours IIRC for the cheap option and minutes for the costly one?)

That's good to know, thanks!

syphar avatar Mar 07 '24 18:03 syphar