certmagic Corrupted metadata JSON files caused by bug #297

The bug #297 frequently leads to corrupted JSON files when multiple instances mount a shared NFS directory as storage. We’ve encountered numerous cases in production where this causes certificate failures with the error message decoding certificate metadata: invalid character '}' after top-level value, rendering the affected sites completely unusable.

The number of corrupted files is increasing, and I can’t restart the service.

I see the bug fix is in the master branch. Please release a new version that includes this fix. Additionally, how can I identify and repair the already corrupted files among thousands without deleting all files? Alternatively, is there a way to ignore corrupted files during loading?

Jul 16 '24 02:07 lqs

If you're in a hurry, you can make a build yourself with this:

xcaddy build --with github.com/caddyserver/certmagic@16e2e0b

Jul 16 '24 04:07 francislavoie

Curious what evidence you have that this and #297 are related -- how do you know that superfluous ARI requests is corrupting files?

Sep 23 '24 22:09 mholt

I’m using certmagic as a library for a web server that serves front-end files and supports custom domains. When the error occurred, I checked the logs and file modification times and found that an extra ‘}’ appeared after multiple ARI updates. After reviewing the source code, I suspect this issue is due to concurrent writes to NFS.

Prior to the error, I was already preparing to migrate from NFS to S3 and implemented a custom storage to access S3. After the error, I expedited the migration process. Since S3 writes are atomic, it prevented the issue, even though redundant ARI requests still remain.

Sep 24 '24 07:09 lqs

NFS has known bugs related to synchronization, that might be the actual problem.

S3 does not provide atomic operations for us to be able to safely offer synchronization, even if writes are synced. I recommend using a database like MySQL/Postgres/Redis for high concurrency distributed storage.

Sep 24 '24 12:09 mholt

Moving another discussion with @Zenexer into here:

As far as I can tell, this isn't fixed by 16e2e0b3443037882be32c731d1e85a90cb69014. I'm able to repro the extra } error reliably--multiple times per hour--regardless of the lengths of the original and new files, so it's not just a simple truncation issue. It's always one extra }, even though file sizes often differ by more than one character.

Removing the extraneous closing braces restores sanity, but only briefly. They keep reappearing in new files.

Originally posted by @Zenexer in #297

Anyway, @lqs, from what you're saying:

Using NFS, files have extra }
Using S3, files don't have extra } but still redundant ARI requests still happen

This actually checks out with known issues with both of those storage backends (as noted just above). NFS has sync/flush issues when it comes to concurrent users over a network; and S3 doesn't provide atomic operations, so proper locking/syncing of an operation like an ARI request is impossible.

@Zenexer, are you also using NFS perchance?

Sep 25 '24 12:09 mholt

I spent about a dozen hours debugging this yesterday, and I believe my initial comment was incorrect: rather than the bug persisting, I believe I just hadn't sufficiently cleaned all of the existing corrupt files. There were situations in which there were two trailing bytes at the end of a file (\n}), and my cleanup script didn't account for that.

I am using NFS, but it does appear to support locking correctly with my current mount options--or, at least, in a way that is compatible with this patch. I'm not a huge fan of NFS and generally don't trust it, but it should work with this lock/write pattern. I doubt it would ever make sense to officially support NFS given how fickle it is, but the locking code in certmagic is straightforward enough that I should be able to debug and patch it if there are further issues.

The one thing that still has me a little worried is a disconcerting number of requests to the on-demand ask endpoint. The docs make it sound as though that's to be expected, but it's accompanied by a large number of log entries related to ARI. I don't think it's a bug--it's probably just a coincidence--but I happened to notice it while troubleshooting.

Sep 25 '24 14:09 Zenexer

That's good to hear! Yours was the only feedback so far that it didn't fix the issue, so it's reassuring that it was an oversight.

Sep 25 '24 16:09 francislavoie

That's a relief, thanks for the follow-up.

The one thing that still has me a little worried is a disconcerting number of requests to the on-demand ask endpoint. The docs make it sound as though that's to be expected, but it's accompanied by a large number of log entries related to ARI. I don't think it's a bug--it's probably just a coincidence--but I happened to notice it while troubleshooting.

The ask endpoint can be busy... we could potentially ease this with a bloom filter or something, that we just reset every 5 or 10 minutes (or something like that). But ideally I'd rather the ask endpoint itself do the caching since it knows better logic than we can guess.

I'd be curious if the ARI log entries are redundant (same hostname) or not. I really want that to be fixed (AFAIK it should be already).

Sep 25 '24 17:09 mholt

The ask endpoint can be busy... we could potentially ease this with a bloom filter or something, that we just reset every 5 or 10 minutes (or something like that). But ideally I'd rather the ask endpoint itself do the caching since it knows better logic than we can guess.

I would assume that any Caddy user with that sort of traffic probably has caching on their ask endpoint anyway and can keep it fast. From my perspective, though, I'd like to be able to log exactly when I've told Caddy it was authorized to go out and request a certificate: having that logging on my application helps me with troubleshooting, since I can use that to determine where in the stack a problem is arising that might be leading to excessive certificate requests. As it currently stands, I don't know whether an ask is for a certificate request/renewal, an ARI request, or some other maintenance task being performed by Caddy--it's a black box.

I'd be curious if the ARI log entries are redundant (same hostname) or not. I really want that to be fixed (AFAIK it should be already).

I'll try to figure that out, but Caddy is spitting out gigabytes of log data with on-demand TLS enabled, so I'm still trying to sort out what's important and what's not.

Sep 25 '24 17:09 Zenexer

I think most of the log entries are the result of various hosting providers and registrars trying to request or renew certificates for domains that no longer point to them, with no checking on their end prior to starting the ACME challenge process. That makes it really difficult to tell the difference between legitimate ACME-related error messages and errors that I can safely ignore. (Ugh, I really wish CAs wouldn't waive rate limiting on challenge failures for large integrators--it hurts everyone.) I don't think that should affect ARI log messages, so I'll let docker compose logs -fn0 | grep -F '"updated ACME renewal information"' and see what turns up.

Sep 25 '24 18:09 Zenexer

I'm not seeing any overlap between ARI requests so far. Each one is unique.

Sep 25 '24 18:09 Zenexer

I'd like to be able to log exactly when I've told Caddy it was authorized to go out and request a certificate; having that logging on my application helps me with troubleshooting, since I can use that to determine where in the stack a problem is arising that might be leading to excessive certificate requests. As it currently stands, I don't know whether an ask is for a certificate request/renewal, an ARI request, or some other maintenance task being performed by Caddy--it's a black box.

To make sure I understand, you want a way for the 'ask' request to distinguish whether a certificate is being obtained or something else?

The only times the 'ask' endpoint is invoked are currently when a certificate needs to be obtained or renewed. It does not guard ARI requests or other maintenance, per-se, though in theory it should be guarding them implicitly, because if you cannot obtain or renew a cert, you cannot maintain it either.

Technically, 'ask' is invoked before even trying to load a certificate from storage (as that can be expensive depending on the storage backend).

So I guess, to your request, I would say that it shouldn't matter, but I'm open to discussing this further if desired.

I think most of the log entries are the result of various hosting providers and registrars trying to request or renew certificates for domains that no longer point to them, with no checking on their end prior to starting the ACME challenge process. That makes it really difficult to tell the difference between legitimate ACME-related error messages and errors that I can safely ignore. (Ugh, I really wish CAs wouldn't waive rate limiting on challenge failures for large integrators--it hurts everyone.) I don't think that should affect ARI log messages, so I'll let docker compose logs -fn0 | grep -F '"updated ACME renewal information"' and see what turns up.

I was one who advocated for exemptions to the rate limits when conforming to ARI, out of concerns that certificate renewals would be rejected -- sometimes past their expiration -- on account of rate limits, even though it was the CA who specified the renewal window. So to ensure certificates can be renewed even if they have to be squished into a narrow window, Let's Encrypt (rightly) exempts clients from rate limits in that situation. Why do you think it hurts everyone?

Is Caddy attempting to renew lots of certificates for you and failing?

I'm not seeing any overlap between ARI requests so far. Each one is unique.

That's good, so it sounds like the synchronization is working. :+1: Thanks for checking on that.

Sep 25 '24 19:09 mholt

To make sure I understand, you want a way for the 'ask' request to distinguish whether a certificate is being obtained or something else?

Yes, mostly for debugging purposes. If I see that multiple Caddy instances are all trying to get certs at the same time, that's a sign something is amiss. It would likely help when troubleshooting future concurrency issues, but a lightweight plugin could probably serve the same purpose.

I'm still not 100% confident the remaining errors I'm seeing are benign, but I'll have more data over the next few days.

Sep 25 '24 20:09 Zenexer

I was one who advocated for exemptions to the rate limits when conforming to ARI, out of concerns that certificate renewals would be rejected -- sometimes past their expiration -- on account of rate limits, even though it was the CA who specified the renewal window. So to ensure certificates can be renewed even if they have to be squished into a narrow window, Let's Encrypt (rightly) exempts clients from rate limits in that situation. Why do you think it hurts everyone?

Sorry, I realized I forgot to answer this question. I don't have an opinion on the scenario you mentioned. What appears to be happening is twofold:

foo.example used to point to <very large hosting provider>, but now points to me. Said hosting provider doesn't even bother to check whether the domain points to them before trying to renew their certificates. It might not have pointed to them for months, and they just don't have any reason to bother cleaning up.
Vulnerability scanners are hitting /.well-known/acme-challenge/*

Access logs have since shown that the second issue accounts for the majority of the "no challenge data found" warnings I was seeing. Disabling HTTP-01 mostly resolved that.

The first issue is far more problematic: I can't stop other hosting providers from requesting certs, and they're wasting PKI resources. It also wastes my time because they cause concerning log entries. These large hosting providers don't really have any incentive to check whether hosts point to them before requesting a cert; they just offload that to the CA.

Meanwhile, I have to check that a domain points to me--and only to me--when Caddy hits the ask endpoint. Caddy's docs explicitly say I shouldn't do this, but I can't go around requesting certs for domains just because I think they might point to me. I have to actually verify that, then cache the results for a while. If I don't do that, I'll get rate limited pretty quickly.

Sep 27 '24 03:09 Zenexer

@Zenexer It sounds like your ask endpoint needs to check to make sure you are expecting to be getting a certificate for those domain names.

I have to check that a domain points to me--and only to me--when Caddy hits the ask endpoint.

Not exactly; why not check your database (or whatever is relevant to your application/service) to see if you should be expecting to maintain a certificate for a hostname? That's the purpose of the ask endpoint and it should resolve the rate limit problems, yeah?

Sep 27 '24 17:09 mholt

It sounds like your ask endpoint needs to check to make sure you are expecting to be getting a certificate for those domain names.

It already does that. I am expecting to get certificates for those domains.

Scenario 1: I control example.com. Large hosting provider used to control example.com, and they used to have certificates for it. They keep trying to renew those certificates. My ask endpoint needs to return 200, because I need certificates.

Scenario 2: I control example.com. Someone runs Acunetix against example.com. For whatever reason, it starts brute forcing random paths with a prefix of /.well-known/acme-challenge/--for example, /.well-known/acme-challenge/xmlrpc.php. I want certificates for example.com, so the ask endpoint returns 200.

Not exactly; why not check your database (or whatever is relevant to your application/service) to see if you should be expecting to maintain a certificate for a hostname? That's the purpose of the ask endpoint and it should resolve the rate limit problems, yeah?

I do. However, I'm running a free service to which non-technical users can point their domains. They might use my service for a while, decide they don't like it, and point their domain elsewhere. I can't trust that the users are going to maintain a list of domains that point to me, so I do have to validate the A/AAAA records before requesting a certificate. The result gets cached in Redis for a while, so subsequent calls to the ask endpoint are very fast.

Sep 27 '24 17:09 Zenexer

I think to simplify this, what @Zenexer is stating is that a third party actor can initiate a request to /.well-known/acme-challenge/$token - and caddy will hit the ask endpoint even if caddy knows that the challenge is bogus, or the challenge is real but caddy can't solve it (e.g, it was a token that caddy does not know about).

I believe for security reasons caddy already needs to keep a track of the list of tokens its generated through applying for ACME challenges. Checking for a string in a list of strings is probably significantly faster to do first, before sending a request to the ask endpoint (even if its not recommended for the ask endpoint to take long, its still a request to a third party thing).

Would it make sense to flip the order of operations here? Caddy does the first initial sanity check (e.g. "I don't even know what this token is, trash it?)

I guess if you have a very slow storage driver, then maybe that operation will be slower... but maybe my naive view is that the storage driver is likely to be faster than the ask endpoint the vast majority of the time, and if the ask request returns a 200 then you're still going to need to hit the storage driver anyway.

Beyond that there's also the problem that ask is operating with far, far less information than caddy is. I suppose the ask request can hook into the same storage and check for the token being there or not. But for that to work, the token would also need to be part of the ask payload.

Sep 27 '24 18:09 aaomidi

I believe for security reasons caddy already needs to keep a track of the list of tokens its generated through applying for ACME challenges.

It does, but that's kept in the storage. What Matt is saying is that in some setups, the storage lookup is more expensive than the ask lookup.

Caddy can be run in a cluster, so it must use the storage to see whether another Caddy instance initiated issuance. It can't rely on an in-memory cache.

The ask endpoint should only do a DB lookup, it should not have side effects. If it has side effects, it's incorrectly implemented.

Would it make sense to flip the order of operations here? Caddy does the first initial sanity check (e.g. "I don't even know what this token is, trash it?)

That's what it was originally until about a year ago, storage was checked first, but that was bad for some users.

Sep 27 '24 18:09 francislavoie

The ask endpoint should only do a DB lookup, it should not have side effects. If it has side effects, it's incorrectly implemented.

That's impossible. It has to do DNS lookups at any sort of scale, despite the documentation. There's just no use case for on-demand TLS that doesn't necessitate frequently double-checking DNS. Calling it incorrect isn't helpful.

Sep 27 '24 18:09 Zenexer

You should not be doing DNS checks. That doesn't make sense. It's not your ask endpoint's responsibility to do that, it's Caddy's. You should have an allow list of domains in your database that you compare against.

Sep 27 '24 19:09 francislavoie

That's impossible. It has to do DNS lookups at any sort of scale

Not really, this fully depends on the use case defined here imo. In your use case, you're going to need to do a DNS lookup, but that's not necessarily universally applicable.

That's what it was originally until about a year ago, storage was checked first, but that was bad for some users.

Yeah that makes sense tbh. Computers suck.

Maybe the ability to choose at which stage ask is performed can be useful?

Sep 27 '24 19:09 aaomidi

Not really, this fully depends on the use case defined here imo.

I'm having a hard time envisioning common use cases for on-demand TLS in which the operator of Caddy has exclusive ownership and control over all of the domain names pointed to it.

The obvious use case seems to be a hosting provider or integrator. They need to regularly verify that any domains provided by end users actually, truly point to them before requesting certificates. Caching that information for too long is risky, especially since DNS is prone to misconfiguration.

Sep 27 '24 19:09 Zenexer

I'm having a hard time envisioning common use cases for on-demand TLS in which the operator of Caddy has exclusive ownership and control over all of the domain names pointed to it.

Company setting where they have one global load balancer, but a secondary system checks that a domain is owned by the company before being put into a given database.

This is different from your use case where you let arbitrary users point their domains to your infrastructure with no registration/signup process (e.g. you don't know whats even linked to you until you get a request!)

Both of these use cases should work IMO

Sep 27 '24 19:09 aaomidi

@Zenexer What you should do is have your customers register their domain via your app's settings, and you add it to your DB allow list. Then all the ask endpoint does is compare against that list. That's how it's meant to work.

Sep 27 '24 19:09 francislavoie

@Zenexer’s use case allows people to point their domains to their infrastructure without any registration, etc.

From what I’m understanding, there is no registration or direct user involvement. The only involvement is going to DNS and changing name servers or A records.

Sep 27 '24 19:09 aaomidi

@Zenexer What you should do is have your customers register their domain via your app's settings, and you add it to your DB allow list. Then all the ask endpoint does is compare against that list. That's how it's meant to work.

That's the erroneous assumption that leads to issue 1. Hosting providers assume that just because a domain is registered with them, they will pass DV.

I don't make that assumption. Hosting providers often do, which is why we're here. Caddy sees ACME challenges that were started by other hosting providers who don't bother to check DNS before requesting a cert.

Sep 27 '24 19:09 Zenexer

I don't understand. Your ask endpoint would reject it so Caddy wouldn't issue a cert. I don't see how what "hosting providers" do matters here.

Sep 27 '24 19:09 francislavoie

I don't understand. Your ask endpoint would reject it so Caddy wouldn't issue a cert. I don't see how what "hosting providers" do matters here.

Hosting Provider A runs Caddy with on-demand TLS.
Hosting Provider B also runs Caddy with on-demand TLS.

Alice owns example.com.
Alice points example.com to Hosting Provider A.
Hosting Provider A tries to get a cert for example.com.
Hosting Provider A's ask endpoint returns 200.
Let's Encrypt issues a cert to Hosting Provider A.
Alice changes example.com's DNS to point to Hosting Provider B.
Alice doesn't tell Hosting Provider A that example.com has been moved.
Caddy at Hosting Provider A tries to renew its cert.
Hosting Provider A's ask endpoint sees example.com in its database and returns 200.
Caddy at Hosting Provider A goes to Let's Encrypt and starts the challenge process.
Let's Encrypt attempts to complete the HTTP-01 challenge by making a request to http://example.com/.well-known/acme-challenge/something.
Caddy at Hosting Provider B receives this challenge request.
Caddy at Hosting Provider B receives a 200 response from its ask endpoint.
Caddy at Hosting Provider B checks for challenge data in its storage, but finds nothing. It logs a warning.

This happens dozens of times per second. I'm Hosting Provider B in this scenario.

If I rely on my database, I become Hosting Provider A.

Sep 27 '24 19:09 Zenexer

I don't understand. Your ask endpoint would reject it so Caddy wouldn't issue a cert. I don't see how what "hosting providers" do matters here.

There is no external database to check in this circumstance. There is no UI. There is no app.

There is a page with instructions: if you want to park your domain with ${whatever}, please point your A and AAAA records to ${whatever}.

The database is DNS.

The answer here may be that this is not a supported use case of caddy, but imo this can very easily be a supported use case with slight modifications.

Sep 27 '24 20:09 aaomidi

Just catching up after feeding the baby and running some kids around... sorry!

I've been drafting this reply while several new replies have come in -- I wish GitHub would show that someone was replying or at least show the new replies. I feel like the conversation went off-track and got confused by some things, but maybe I did instead. In any case, here's my attempt to bring it back:

@Zenexer

Thanks for clarifying above. It seems to me you have an extraordinary situation that is not common from what I know of existing large-scale CertMagic deployments. That's not bad, just something that is worthy of discussion/understanding.

Scenario 1: I control example.com. Large hosting provider used to control example.com, and they used to have certificates for it. They keep trying to renew those certificates.

So, in this case, the hosting provider would fail their own ACME challenge, but your server would likely get pinged with a TLS handshake or HTTP request in an attempt to solve the challenge.

The HTTP-01 challenge does not use TLS, so those would not issue a certificate. You'd see junk in your logs, but what's new (it's the Internet).

The TLS-ALPN-01 challenge does use TLS, but with a special ALPN value. When it sees a handshake of this sort, it only follows a special code path that serves the challenge solution certificate (if it doesn't find one it just returns an error and aborts the handshake).

Neither case will initiate an ACME challenge that you end up failing and getting rate limited for. (If they are, that's a bug I'd like more details, likely in a separate issue.)

Scenario 2: I control example.com. Someone runs Acunetix against example.com. For whatever reason, it starts brute forcing random paths with a prefix of /.well-known/acme-challenge/--for example, /.well-known/acme-challenge/xmlrpc.php. I want certificates for example.com, so the ask endpoint returns 200.

@aaomidi

I think to simplify this, what @Zenexer is stating is that a third party actor can initiate a request to /.well-known/acme-challenge/$token - and caddy will hit the ask endpoint even if caddy knows that the challenge is bogus, or the challenge is real but caddy can't solve it (e.g, it was a token that caddy does not know about).

The ask endpoint is only invoked if the client tries to establish a TLS handshake using a domain name it does not have a certificate for (and isn't itself a challenge handshake). But this endpoint is for the HTTP-01 challenge, which is HTTP-only. Are the servers accessing this plaintext endpoint over HTTPS? That would be the only way this is possible, but is extremely broken / in violation of spec.

Sep 27 '24 20:09 mholt

certmagic certmagic copied to clipboard

Corrupted metadata JSON files caused by bug #297

certmagic
certmagic copied to clipboard