ol-infrastructure
ol-infrastructure copied to clipboard
Use Redis serverless for Open edX deployments
Description/Context
The combined load of caching and Celery tasks for Open edX systems periodically exhausts the configured Redis cluster in Elasticache. Elasticache now offers a serverless deployment of Redis that removes the upper and lower limit of capacity, removing the need to statically allocate the maximum needed cluster. The serverless offering only supports Redis 7.1 and higher, which is supported based on the default image used in Tutor.
Plan/Design
- [ ] Investigate and document the Pulumi code changes necessary to utilize serverless Redis
- [ ] Modify the edxapp Pulumi code to provision and configure a serverless Redis cluster
- [ ] Determine appropriate cutover process to migrate cache to new serverless cluster
- [ ] Ensure that the changes have been applied to all environments
- [ ] xPRO
- [ ] CI
- [ ] QA
- [ ] Production
- [ ] MITx Online
- [ ] CI
- [ ] QA
- [ ] Production
- [ ] MITx
- [ ] CI
- [ ] QA
- [ ] Production
- [ ] MITx Staging
- [ ] CI
- [ ] QA
- [ ] Production
- [ ] xPRO
Relevant discussion in Open edX forum - https://discuss.openedx.org/t/redis-memory-max-memory-page-load-times-and-useability-suffer-dramatically/12782/16
The key eviction policy in Redis has been updated to use LRU on all keys, not just keys that have a TTL set. https://github.com/mitodl/ol-infrastructure/commit/e5d23ad8d1702b92113be205968672de08667971
I’m looking at the redis serverless, just from a high level to see if it makes sense. I’m not sure it does in all instances. I can’t find a good reference on ‘ECPUs’ and how to estimate it. I suspect the best way to get a feel for it it is just just do it in a CI environment and see how far off my naive estimate is vs reality (in either direction up/down).
Outstanding question: When the new lru-allitems config makes it to production (I don’t believe it is there yet), will the ~15GB -> something more reasonable? I suspect 15GB represents near max usage on the node just because it doesn’t evict or expire anything until it absolutely needs to.
edxapp-redis-mitx-ci
Current Env:
Node Costs:
cache.t3.small x 3 = 25.30 * 3 = $75.90/m
Data: ~ 40MB
Total Network Traffic (In + Out): 29GB
Serverless:
Storage: 1GB - min -> $90.00/m
ECPUs:
Naive Calc based on network traffic alone: 29,000,000,000 / 1,000,000 * 0.0034 = $98.6/m
Serverless Costs: ~$188/m
edxapp-redis-mitx-production
Current Env:
Node Costs:
cache.r7g.4xlarge x 3 = 1273.85 x 3 = $3821.55/month
Data: ~ 15GB
Total Network Traffic (In + Out): 639GB
Serverless:
Storage ~15-16GB * 0.125GB/h = between $1350.00 - $1440.00/m
ECPUs:
Naive Calc based on network traffic alone: 639,000,000,000 / 1,000,000 * 0.0034 = $2172.6/m
Serverless Costs: $3522.60 - $3612.60/m
The fix for this issue ended up being the need to set a parameter on Redis for managing the key eviction strategy to include all keys. There is no longer any pressing need to adopt Redis serverless. If and when we decide to use the serverless offering in the future we can open a new issue with appropriate scope.