certmagic icon indicating copy to clipboard operation
certmagic copied to clipboard

Storage check -- can it be removed?

Open ankon opened this issue 2 years ago • 9 comments

https://github.com/caddyserver/certmagic/blob/76f61c2947a20d86ca37669dbdc0ed7a96fc6c5f/config.go#L456-L461

Depending on the storage this check can be quite expensive (in our case a remote KV store), and it happens in a rather "hot" area: Right when setting up a TLS connection. If we need to get a certificate from a CA this doesn't matter too much in the grand scheme of things, but if our storage has one we triple our roundtrips (1 to get the cert information from storage, and 2 for a write/read from storage to check it).

I think this check might have some value, but not here:

  • If the storage is generally trustworthy, it should be checked before it gets configured for certmagic
  • If the storage is generally untrustworthy and needs regular checks, it should be part of that specific storage

So, I'd propose to just drop these lines here, and possibly move the implementation of checkStorage as an example into documentation. WDYT?

ankon avatar Aug 24 '22 08:08 ankon

I'm open to optimizations, but first:

It should already only do the check if a certificate is needed, which should only be "hot" once every 60 days or so -- and even that runs in the background except the very first time. Shouldn't be blocking a TLS handshake after the first time it gets a certificate.

What kinds of performance impacts are you experiencing? How have you verified that the storage checks are causing performance problems?

mholt avatar Aug 24 '22 17:08 mholt

When the server using on-demand TLS with certmagic starts up, its in-memory caches are empty, and that's what I'm trying to optimize right now. Our Storage is using AWS DynamoDB (based on https://github.com/silinternational/certmagic-storage-dynamodb), and as much that scales, it does see quite some load I'd like to get rid off.

But, you're right indeed ... before it would do the storage check, obtainCert first checks with storageHasCertResourcesAnyIssuer for any existing resources. This call should return true (our implementation of Exists internally does a Load and checks for whether that worked), and following the code from there I think it will pick up the certs through an actual Storage.Load in Config.CacheManagedCertificate/Config.loadManagedCertificate.

Leaves me still with some not-so-nice performance if the above is true: I should end up with 3 distinct read requests in storageHasCertResourcesAnyIssuer, and another 3 in loadManagedCertificate. I'll have to measure this more closely to confirm this theory. Ideally I'd like this number to be 1, of course :)

I guess I could get there by a) Have a tiny internal cache with a short expiration for Load, so that the second load just grabs from that cache rather than roundtrip b) recognize the storageHasCertResourcesAnyIssuer and pre-load all keys into that cache (in DynamoDB through BatchGetItem or possibly by using a smart query/index operation)

It also IMHO still leaves the question open whether the storage check would not be better in the storage itself, so that an implementation could decide how to check?

ankon avatar Aug 24 '22 20:08 ankon

The point of the storage check is to ensure the storage is properly configured, it's not really dependent upon how "trustworthy" the storage mechanism is.

It's tempting to say that just the file system needs such a check to ensure there's enough disk space, but actually most problems I've heard of in production stem from getting its CertMagic config wrong, or the actual storage system itself is misconfigured.

Maybe we could make storage checks optional, but disabling them poses a risk too.

(I suppose ideally we'd optimize them. I have a TODO that they don't need to happen so frequently, but haven't gotten around to implementing that. Even if it's just a simple memory of last time the storage was checked.)

mholt avatar Aug 24 '22 21:08 mholt

What about sth like this:

diff --git a/config.go b/config.go
index ed155d5..266415d 100644
--- a/config.go
+++ b/config.go
@@ -976,11 +976,19 @@ func (cfg *Config) getChallengeInfo(ctx context.Context, identifier string) (Cha
 	return Challenge{Challenge: chalInfo}, true, nil
 }
 
-// checkStorage tests the storage by writing random bytes
-// to a random key, and then loading those bytes and
+// checkStorage tests the storage
+//
+// If the storage implements `StorageChecker` its StorageChecker.Check
+// method will be called, otherwise the storage is checked by writing
+// random bytes to a random key, and then loading those bytes and
 // comparing the loaded value. If this fails, the provided
 // cfg.Storage mechanism should not be used.
 func (cfg *Config) checkStorage(ctx context.Context) error {
+	if sc, ok := cfg.Storage.(StorageChecker); ok {
+		// Storage knows how to check itself, so let it do that.
+		return sc.Check(ctx)
+	}
+
 	key := fmt.Sprintf("rw_test_%d", weakrand.Int())
 	contents := make([]byte, 1024*10) // size sufficient for one or two ACME resources
 	_, err := weakrand.Read(contents)
diff --git a/storage.go b/storage.go
index 0bb34cd..4fc0cfe 100644
--- a/storage.go
+++ b/storage.go
@@ -81,6 +81,12 @@ type Storage interface {
 	Stat(ctx context.Context, key string) (KeyInfo, error)
 }
 
+type StorageChecker interface {
+	// Check returns whether the storage is "good to go" or
+	// whether it cannot be used right now
+	Check(ctx context.Context) error
+}
+
 // Locker facilitates synchronization across machines and networks.
 // It essentially provides a distributed named-mutex service so
 // that multiple consumers can coordinate tasks and share resources.

For me it would allow me to implement the Check method (for instance by only checking once), for existing code nothing would change, and other more advanced storages could indeed implement some "once in while" behavior without needing a change in certmagic itself?

ankon avatar Aug 25 '22 07:08 ankon

Maybe, I still feel like storage problems are more related to configuration than actual implementation (usually -- assuming a proper implementation and a capable storage backend).

I feel like a switch on or off in the config is probably a better solution to this. Would that be OK with you?

mholt avatar Aug 26 '22 04:08 mholt

I feel like a switch on or off in the config is probably a better solution to this. Would that be OK with you?

Sure, that would work for us! Shall I make a PR to that effect?

ankon avatar Aug 26 '22 08:08 ankon

@ankon Sorry for my delayed reply. Yes, I'd be happy to review that. Probably have the checks enabled by default, with the option to disable them.

mholt avatar Sep 01 '22 17:09 mholt

I would also love to see an option to disable this. Our use-case would be the same as @ankon but for a S3 storage driver :)

mbardelmeijer avatar Sep 15 '22 14:09 mbardelmeijer

@mbardelmeijer That's good to know. But be aware that S3 is not safe to use as a storage backend by itself, since it cannot support atomic operations.

mholt avatar Sep 15 '22 15:09 mholt

@mbardelmeijer @ankon I've implemented this in 5357574. Please feel free to try it out!

mholt avatar May 06 '23 02:05 mholt

Thanks a lot!

We just rolled this out for a spin, and according to our database statistics this looks great -- no more rw_test_ keys in the "top contributors" diagrams at all. We're going to leave it running for a while now, and see whether overall statistics also show improvements.

ankon avatar May 11 '23 11:05 ankon

Awesome, thanks for giving it a spin!

mholt avatar May 11 '23 14:05 mholt