Unexpectedly poor cache performance
Caddy version: 2.9.1
The Problem:
When using the cache directive, the throughput of the given config block experiences a dramatic slowdown. On my system with 7950X testing using AB like so:
ab -k -t 60 -n 100000 -c 16 http://localhost:8080/image.png
Without caching: 85000-87000 requests per second With caching: <3600 requests per second
image.png is a small (<20KiB) .png file of 185x250px dimensions.
Context:
For comparison, nginx in a similar scenario manages slower non-cached throughput (50-60k r/s), but dramatically faster cached throughput (15-17k r/s).
I tested versus nginx as my project presently uses its fork, Openresty (which is just nginx but with Lua scripting), to serve and cache static assets, and reverse proxy the main app. The expected production traffic consists of a high rate of requests against object storage backed image and thumbnail storage. It is crucial that this is cached (object storage egress costs money), and that the cache is performant enough to keep up with a high volume of requests.
If I were to switch to Caddy, the performance difference must not be too much worse.
The Config
{
cache
}
:80 {
route {
cache # the "no cache" test was done with this line commented-out
file_server {
root /var/www
}
}
}
Caddy is built via Docker like so:
ARG CADDY_VERSION="2.9.1"
FROM caddy:$CADDY_VERSION-builder-alpine AS builder
RUN xcaddy build \
--with github.com/caddyserver/cache-handler
FROM caddy:$CADDY_VERSION-alpine
COPY --from=builder /usr/bin/caddy /usr/bin/caddy
EXPOSE 80
and ran like so
docker run --rm \
-it \
-v /home/luna/code/caddytest/config:/etc/caddy \
-v /home/luna/code/caddytest/data:/data \
-v /home/luna/code/caddytest/static:/var/www \
-p 8080:80 \
caddytest:latest
Needless to say, a very basic and barebones setup.
What I tried:
- Using a different storage backend for the caching module (tried otter, etcd)
- Putting it in a
handleblock - Using it without any
routeorhandleblocks
This performance discrepancy compared to our current setup with nginx is completely unacceptable and prevents me from moving ahead with using Caddy in production (as my resources rely caching having a high throughput).
Well, caching static content from the filesystem doesn't really make sense. Caching is moreso for dynamic content (when it incurs a cost to producing the content, so that cost is only spent once per cache TTL), or to avoid hitting the reverse proxy to short circuit the request pipeline. file_system is optimized to be very fast, and syscalls to read content from files are very fast, any caching layer added in front of that is more likely to always be much slower.
I assume the filesystem example is just to provide something easy for caddy developers to replicate, not for the sake of how caching would actually be used.
Regardless, the performance comparison is totally apples and oranges. Comparing a highly optimized filesystem to a cache storage driver isn't a fair comparison.
Well I initially discovered the slowness of the cache when benchmarking my current nginx / openresty setup vs. a caddy rewrite. All was splendid and Caddy was either on par or faster, until it came to serving data from S3 with a cache layer in front of it. With cache, it managed 15-18k requests per second in openresty, but merely ~3500 requests per second in Caddy. Raw S3 requests against a local S3 server (s3proxy) without cache in Caddy were much faster than caching them too, caddy-fs-s3 module managed ~6k requests per second, and s3-proxy module managed 10-12k requests per second. And before someome says it, I know that non-local requests (like those done in production) would be much slower, this doesn't excuse the cache being 5x slower than its nginx counterpart though. The production site is expected to have a high rate of requests for smaller files (thumbnails), so having a 5x slower cache would be a deal breaker.
I wondered whether or not the slowdown was somehow the fault of my bad configuration, so I distilled it to the test case above, which clearly shows the caching module being at fault.
There's probably something to this that would be worth investigating as the performance disparity is pretty glaring.
But I also wonder why you're even doing this...? If you're offloading something from your server to S3, why would you then cache it on the server again? Either don't offload it at all, or do and then use Cloudfront (or another) CDN for caching those requests (though, you really ought to consider other services - S3's storage and, especially, egress bandwidth costs are outrageous)
I don't know how it's relevant to this issue for me to justify why I need to cache my object storage backend, or how it's relevant to suggest an entirely different (and astronomically more expensive than a 15 eur/mo VPS) solution while the complaint is about a specific problem with this solution. The short answer is, because it benefits the pattern of use we observe from most of our website's users, and because we don't use AWS, we use Cloudflare R2 (I just refer to it as "S3" because that's what the API is called), and we store terabytes of content in the object storage, a server with terabytes of SSD is expensive.
Please note that my first sentence said that this is probably worth looking into... The rest was just a question - no need to get contentious.
Likewise, I was very much not suggesting you move everything back to your server. I was saying that since you've offloaded from the server to s3/r2, why would you then implement caching on the server again? Especially since Cloudflare cdn is completely free and automatic when using R2... (technically it's always free and automatic, but they officially have limits on media files that seemingly get ignored in general, but could be enforced. But I believe using R2 allows for serving anything via the cdn)
Ps, since budget seems to be a factor for you, providers like Hetzner offer extremely cheap and powerful dedicated servers with lots of storage and egress bandwidth. They also have an s3 service as well. Both would probably be cheaper than R2. Perhaps worth looking at, especially since you appear to be in Germany.
@Meow when you're using the plain file_server, you don't set any Cache headers nor caching validation, that's why it's faster. But in fact it will be faster using a web browser with the cache-handler because it will serve content from the local cache (your laptop) instead of the distant cache (the cache-handler). It will also handle the validation for you (put it in your cdn, invalidate the items, etc...).
I would like to have the best performances but there are compromises to be made between performances and spec/RFC compliancy.
As I have mentioned before, the plain file_server was merely an illustrative example that is easy to reproduce. The actual use case places the caching module in front of an S3 (in our case Cloudflare R2) backend to reduce the amount of hits against the file storage itself.
I have tested this plugin on the staging environment, by the way, to verify if it's suitable for my production use case, and I would probably have to file another issue, because the module both had terrible cache miss rates (dramatically higher than nginx, by a factor of nearly 10x), and it also crashed the whole Caddy because various in-memory storage backends exhausted server RAM (except the redis backend, but that's only because I configured it to have sane memory limits), and there didn't seem to exist a module which relies solely on filesystem cache. I have tried every single cache storage backend which was available, and tried configuring them in various different ways to the best of my ability, but even when I managed to avoid outright RAM exhaustion, the cache miss rates were unacceptably, disappointingly high.
We may investigate together (on slack or discord) to fit to your use cases and maybe reduce the friction/hit rates.
I'm available on Discord (I don't use Slack for any non-work things, which this one is, considering my prod deploy is a hobby project), for the sake of my own privacy I'd prefer not to post it in this comments section. I'll send you an email shortly with my username so you may add me (seeing as I was unable to find any Discord invitations in the README).
Well I initially discovered the slowness of the cache when benchmarking my current nginx / openresty setup vs. a caddy rewrite. All was splendid and Caddy was either on par or faster, until it came to serving data from S3 with a cache layer in front of it. With cache, it managed 15-18k requests per second in openresty, but merely ~3500 requests per second in Caddy. Raw S3 requests against a local S3 server (s3proxy) without cache in Caddy were much faster than caching them too, caddy-fs-s3 module managed ~6k requests per second, and s3-proxy module managed 10-12k requests per second. And before someome says it, I know that non-local requests (like those done in production) would be much slower, this doesn't excuse the cache being 5x slower than its nginx counterpart though. The production site is expected to have a high rate of requests for smaller files (thumbnails), so having a 5x slower cache would be a deal breaker.
I wondered whether or not the slowdown was somehow the fault of my bad configuration, so I distilled it to the test case above, which clearly shows the caching module being at fault.
I completely agree with this. I don’t know what exactly the developers did wrong, but the performance is ridiculously bad. It feels like there’s some serious underlying issue. I tried different versions and storages, even in-memory ones like Otter and Redis, and the RPS is so low it just doesn’t make sense. But it’s really happening. It’s surprising that people are still using a plugin that clearly doesn’t work as it should.
Understandable, and well, this is an open source project, we invite you to help solve the problem, or your money back :+1:
Hello @Pepsi1917, PRs to improve performances are welcome.