lua-resty-auto-ssl
lua-resty-auto-ssl copied to clipboard
Openresty process becoming unresponsive. Possibly lock related?
Hey! Sorry to bother you again, I think the instability that I had the other day (#135) was a red herring and not actually the cause of the outages that we've been having.
I've been seeing some things in our logs that look like the following:
Restarting selected processes ['openresty']
This is coming from supervisor detecting that openresty has become unresponsive. Scrolling to those logs, i see lots of things regarding a "shared memory zone" (i've redacted some duplicates for brevity):
[notice] 24883#24883: signal 15 (SIGTERM) received, exiting
[notice] 24883#24883: signal 14 (SIGALRM) received
[notice] 24883#24883: signal 17 (SIGCHLD) received
[alert] 24883#24883: worker process 24895 exited on signal 9
[alert] 24883#24883: shared memory zone "auto_ssl" was locked by 24895
[notice] 24883#24883: exit
[notice] 31640#31640: using the "epoll" event method
[notice] 31640#31640: openresty/1.11.2.2
[notice] 31640#31640: built by gcc 4.9.2 (Debian 4.9.2-10)
[notice] 31640#31640: OS: Linux 4.4.0-96-generic
[notice] 31640#31640: getrlimit(RLIMIT_NOFILE): 1048576:1048576
[notice] 31640#31640: start worker processes
[notice] 31640#31640: start worker process 31652
[notice] 31652#31652: signal 17 (SIGCHLD) received
[notice] 31652#31652: unknown process 31653 exited with code 0
[notice] 31652#31652: signal 17 (SIGCHLD) received
[notice] 31652#31652: unknown process 31655 exited with code 0
[notice] 31652#31652: *1 [lua] start_sockproc.lua:10: auto-ssl: starting sockproc, context: init_worker_by_lua*
[notice] 31652#31652: signal 17 (SIGCHLD) received
[info] 31652#31652: waitpid() failed (10: No child processes)
This log output lead me to the following issues: https://github.com/GUI/lua-resty-auto-ssl/issues/43 https://github.com/18F/api.data.gov/issues/325
And a commit which fixed the locking issue: https://github.com/GUI/lua-resty-auto-ssl/commit/8bcb9ee83c1faae849ad456503f841afb804ad75
One of the error messages in that commit "failed to unlock:" is also coming up in our logs with the following error message:
[error] 3706#3706: *20191 [lua] ssl_certificate.lua:53: issue_cert_unlock(): auto-ssl: failed to unlock: lock does not match expected value, context: ssl_certificate_by_lua*
Which appears to come from: https://github.com/GUI/lua-resty-auto-ssl/blob/0fe6ab67f091d75821a865d8d71122dc04e8ff0d/lib/resty/auto-ssl/ssl_certificate.lua#L53
Have you ever seen this before? Is there anything I can do to debug this further? I've already bumped the auto_ssl lua_shared_dict to 100mb in an attempt to mitigate in the chance we were filling it up.
Tracing through the code to find places that may be calling this, I can see that I have some of these errors too: https://github.com/GUI/lua-resty-auto-ssl/blob/0fe6ab67f091d75821a865d8d71122dc04e8ff0d/lib/resty/auto-ssl/ssl_certificate.lua#L77
With timeout being the reason the lock couldnt be created:
[error] 1850#1850: *576 [lua] ssl_certificate.lua:68: issue_cert(): auto-ssl: failed to obtain lock: timeout, context: ssl_certificate_by_lua*
We're using redis as our storage adapter, if that matters? Thanks again for all of your help with this. If there's anything I can do to contribute back to the code for all that you've done, i'd be more than happy to.
I haven't seen this before, but hopefully we can get to the bottom of it. How frequently is this cropping up for you? I'm just wondering how easy this is for you to test or reproduce.
If you're game, would you mind patching your installation by replacing the resty/auto-ssl/ssl_certificate.lua and resty/auto-ssl/storage.lua files with the versions from this branch (and then reload nginx so the changes are live): https://github.com/GUI/lua-resty-auto-ssl/compare/debug-locks Let me know if you need any help tracking down the luarocks installations directory or help on how to replace these files with these patched versions.
These patched files will generate more debugging information in the nginx error log file, which we can hopefully use to get to the bottom of it. The increased logging should only take place when new certs are issues or renewed, but depending on your volume of cert issuances, be aware there will be more log data in your nginx error logs.
This branch does make one change to the distributed unlocking process that might help side-step the lock does not match expected value error you got (but I'm still not totally sure this is necessary or whether there is a potential race condition scenario here). Otherwise, it mostly adds more verbose logging to the locking/unlocking process so we can hopefully better determine where it's getting stuck.
Also, one small note, but your failed to obtain lock: timeout error appears to originate on line 68 (not 77), so I think the error is actually related to the local local, rather than the distributed lock (but we probably should better distinguish those error messages). I'm not entirely sure if this is relevant or not, but just wanted to note that for debugging purposes.
https://github.com/GUI/lua-resty-auto-ssl/blob/0fe6ab67f091d75821a865d8d71122dc04e8ff0d/lib/resty/auto-ssl/ssl_certificate.lua#L68
It's happening maybe a couple of times an hour. I can't confirm or not whether this was introduced in v0.12.0, or whether it's been happening longer than that. I'm also not sure why it's only recently started causing our health checks to fail.
I've patched our production systems using the following commands:
cd /usr/local/openresty/luajit/share/lua/5.1
curl https://github.com/GUI/lua-resty-auto-ssl/commit/dcf9d73eb8b9c3b9ad7d59bd614271117c27af8a.patch -O
patch -p2 < dcf9d73eb8b9c3b9ad7d59bd614271117c27af8a.patch
/usr/local/openresty/bin/openresty -s reload
I'll keep an eye out for the log that means openresty has restarted and get some more context with hopefully some useful debug messages!
Thanks.
OK so it looks like it's possibly triggering that lock failure after a troublesome domain is hitting a LE rate limit:
err Processing xxxxxxxx.com
err + Signing domains...
err + Generating private key...
err + Generating signing request...
err + Requesting authorization for xxxxxxxx.com...
err err: + ERROR: An error occurred while sending post-request to https://acme-v01.api.letsencrypt.org/acme/new-authz (Status 429)
err
err Details:
err {
err "type": "urn:acme:error:rateLimited",
err "detail": "Error creating new authz :: too many failed authorizations recently: see https://letsencrypt.org/docs/rate-limits/",
err "status": 429
err }
err
err , context: ssl_certificate_by_lua*
error 2018/05/18 22:57:07 [error] 11670#11670: *651 [lua] ssl_certificate.lua:106: issue_cert(): auto-ssl: issuing new certificate failed: dehydrated failure, context: ssl_certificate_by_lua*
error 2018/05/18 22:57:07 [error] 11670#11670: *651 [lua] ssl_certificate.lua:43: issue_cert_unlock(): auto-ssl: DEBUG: Cert unlock for: xxxxxxxx.com, context: ssl_certificate_by_lua*
error 2018/05/18 22:57:07 [error] 11670#11670: *651 [lua] ssl_certificate.lua:45: issue_cert_unlock(): auto-ssl: DEBUG: Cert unlock local for: xxxxxxxx.com, context: ssl_certificate_by_lua*
error 2018/05/18 22:57:07 [error] 11670#11670: *651 [lua] ssl_certificate.lua:53: issue_cert_unlock(): auto-ssl: DEBUG: Cert unlock distributed for: xxxxxxxx.com, context: ssl_certificate_by_lua*
error 2018/05/18 22:57:07 [error] 11670#11670: *651 [lua] storage.lua:129: issue_cert_unlock(): auto-ssl: DEBUG: storage:issue_cert_unlock: key=xxxxxxxx.com:issue_cert_lock lock_rand_value=718c8714c6591935f0adf3f62e6ca5f83556aa7a46257a5c1469cdfcf8225e27, context: ssl_certificate_by_lua*
error 2018/05/18 22:57:07 [error] 11670#11670: *651 [lua] ssl_certificate.lua:56: issue_cert_unlock(): auto-ssl: failed to unlock: lock does not match expected value (expected=718c8714c6591935f0adf3f62e6ca5f83556aa7a46257a5c1469cdfcf8225e27 actual=4b46d57cd023c17cf2978b70533c306e5c50691ef54d484babcfd7ccdb6df795), context: ssl_certificate_by_lua*
For some reason that random value isnt matching?
I am hitting the same rate limit and getting the same error, any news on this? @GUI
I'm going to have to turn off auto ssl as it's constantly crashing now, I can't see to find a way of improving the situation at all.
@domharrington Can you provide your NGINX conf files so I can compare? I believe it might be a issue with having multi_accept on; active as I've disabled that and also stopped HTTP > HTTPS redirects and so far no more crashing has occurred.
Okay quick update I am still seeing the process crash out with the locking however it much less frequent now. Still not ideal, has anyone else made any progress on this issue?
Try adding another couple of instances behind a load balancer. That's mitigated most of the problems for us.
I haven't been able to reproduce this myself, but thanks for the help in debugging this everyone!
@domharrington: With the patched version from https://github.com/GUI/lua-resty-auto-ssl/compare/debug-locks, have you still been seeing nginx become unresponsive? The fact that you're seeing the "lock does not match expected value" errors is certainly curious and would seem to indicate some type of race condition. If that error was encountered, then I think I see how things could potentially deadlock in master, but the patched version should still unlock things in that case (rather than waiting for the lock values to match). So I'm curious if this situation might explain the unresponsive issues, but if nginx can still get stuck with these patches we still have some other issue to uncover.
@BuckinghamIO: Have you by any chance tried things with the debug patches and lock changes in https://github.com/GUI/lua-resty-auto-ssl/compare/debug-locks? If you're interested in trying those, let me know if you need any help patching your current installation.
You're notes about multi_accept are also interesting. I'm not sure I've used that setting before, but I can certainly debug more with that to see if it could possibly affect things. Regarding the HTTPS redirects you disable, can you provide more detail on how those redirects were implemented so I can also try debugging with those changes?
Thanks again!
I had the same error yesterday. The problem with mine was my server block was redirecting to https, so the challenge was unable to resolve, and ultimately hit the failure rate limit. I've hit this limit before, and did not have an issue with OpenResty becoming unresponsive before. Not sure why this time was different.
I've pasted the entire log if it is helpful. I separated the [alert] where it occurred ~line 72:
2018/07/02 20:49:54 [error] 1317#1317: *331 [lua] lets_encrypt.lua:41: issue_cert(): auto-ssl: dehydrated failed: env HOOK_SECRET=f0a3a5f93701549339bad858ec895b6f51178177e5727f3e939dfc45d45466eb HOOK_SERVER_PORT=8999 /usr/local/openresty/luajit/bin/resty-auto-ssl/dehydrated --cron --accept-terms --no-lock --domain www.problematic-domain.com --challenge http-01 --config /etc/resty-auto-ssl/letsencrypt/config --hook /usr/local/openresty/luajit/bin/resty-auto-ssl/letsencrypt_hooks status: 256 out: # INFO: Using main config file /etc/resty-auto-ssl/letsencrypt/config
Processing www.problematic-domain.com
+ Signing domains...
+ Creating new directory /etc/resty-auto-ssl/letsencrypt/certs/www.problematic-domain.com ...
+ Generating private key...
+ Generating signing request...
+ Requesting authorization for www.problematic-domain.com...
+ 1 pending challenge(s)
+ Deploying challenge tokens...
+ Responding to challenge for www.problematic-domain.com authorization...
Invalid challenge: DOMAIN=www.problematic-domain.com RESPONSE={
"type": "http-01",
"status": "invalid",
"error": {
"type": "urn:acme:error:connection",
"detail": "Fetching http://www.problematic-domain.com/.well-known/acme-challenge/5GRE1vaqeX8HUJaomYuOPCSz2daTGnd0hpsaHrazAUs: Timeout during connect (likely firewall problem)",
"status": 400
},
"uri": "https://acme-v01.api.letsencrypt.org/acme/challenge/SFxYbrTYujMGri_8HCRoB5lH8ZrbDcsXMbm0Usn2Eo0/5423848866",
"token": "5GRE1vaqeX8HUJaomYuOPCSz2daTGnd0hpsaHrazAUs",
"keyAuthorization": "5GRE1vaqeX8HUJaomYuOPCSz2daTGnd0hpsaHrazAUs.iyFwioEEU2gsGRl51ToulQDLULu1gh7mfp9cpKV_nD8",
"validationRecord": [
{
"url": "http://www.problematic-domain.com/.well-known/acme-challenge/5GRE1vaqeX8HUJaomYuOPCSz2daTGnd0hpsaHrazAUs",
"hostname": "www.problematic-domain.com",
"port": "80",
"addressesResolved": [
"54.33.55.191"
],
"addressUsed": "54.33.55.191"
}
]
}
err: nil, context: ssl_certificate_by_lua*, client: 73.170.48.226, server: 0.0.0.0:443
2018/07/02 20:49:54 [error] 1317#1317: *331 [lua] ssl_certificate.lua:97: issue_cert(): auto-ssl: issuing new certificate failed: dehydrated failure, context: ssl_certificate_by_lua*, client: 73.170.48.226, server: 0.0.0.0:443
2018/07/02 20:49:54 [error] 1317#1317: *331 [lua] ssl_certificate.lua:286: auto-ssl: could not get certificate for www.problematic-domain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 73.170.48.226, server: 0.0.0.0:443
2018/07/02 20:50:07 [error] 1317#1317: *330 [lua] lets_encrypt.lua:41: issue_cert(): auto-ssl: dehydrated failed: env HOOK_SECRET=f0a3a5f93701549339bad858ec895b6f51178177e5727f3e939dfc45d45466eb HOOK_SERVER_PORT=8999 /usr/local/openresty/luajit/bin/resty-auto-ssl/dehydrated --cron --accept-terms --no-lock --domain www.problematic-domain.com --challenge http-01 --config /etc/resty-auto-ssl/letsencrypt/config --hook /usr/local/openresty/luajit/bin/resty-auto-ssl/letsencrypt_hooks status: 256 out: # INFO: Using main config file /etc/resty-auto-ssl/letsencrypt/config
Processing www.problematic-domain.com
+ Signing domains...
+ Generating private key...
+ Generating signing request...
+ Requesting authorization for www.problematic-domain.com...
+ 1 pending challenge(s)
+ Deploying challenge tokens...
+ Responding to challenge for www.problematic-domain.com authorization...
Invalid challenge: DOMAIN=www.problematic-domain.com RESPONSE={
"type": "http-01",
"status": "invalid",
"error": {
"type": "urn:acme:error:connection",
"detail": "Fetching http://www.problematic-domain.com/.well-known/acme-challenge/J-yQV6tHW-NwQous8k26Zq6dbmKOWSwyKrYs1iwF-8I: Timeout during connect (likely firewall problem)",
"status": 400
},
"uri": "https://acme-v01.api.letsencrypt.org/acme/challenge/T1qtVHixtK4L-3ykShUDkSvv3lqdfU8DOi_Chzko5TA/5423850452",
"token": "J-yQV6tHW-NwQous8k26Zq6dbmKOWSwyKrYs1iwF-8I",
"keyAuthorization": "J-yQV6tHW-NwQous8k26Zq6dbmKOWSwyKrYs1iwF-8I.iyFwioEEU2gsGRl51ToulQDLULu1gh7mfp9cpKV_nD8",
"validationRecord": [
{
"url": "http://www.problematic-domain.com/.well-known/acme-challenge/J-yQV6tHW-NwQous8k26Zq6dbmKOWSwyKrYs1iwF-8I",
"hostname": "www.problematic-domain.com",
"port": "80",
"addressesResolved": [
"54.33.55.191"
],
"addressUsed": "54.33.55.191"
}
]
}
err: nil, context: ssl_certificate_by_lua*, client: 73.170.48.226, server: 0.0.0.0:443
2018/07/02 20:50:07 [error] 1317#1317: *330 [lua] ssl_certificate.lua:97: issue_cert(): auto-ssl: issuing new certificate failed: dehydrated failure, context: ssl_certificate_by_lua*, client: 73.170.48.226, server: 0.0.0.0:443
2018/07/02 20:50:07 [error] 1317#1317: *330 [lua] ssl_certificate.lua:286: auto-ssl: could not get certificate for www.problematic-domain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 73.170.48.226, server: 0.0.0.0:443
****** auto_ssl locked *********
2018/07/02 22:37:27 [alert] 1315#1315: worker process 1317 exited on signal 9
2018/07/02 22:37:27 [alert] 1315#1315: shared memory zone "auto_ssl" was locked by 1317
****** auto_ssl locked *********
2018/07/02 22:39:51 [error] 1342#1342: *65 [lua] ssl_certificate.lua:68: issue_cert(): auto-ssl: failed to obtain lock: timeout, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:39:51 [error] 1342#1342: *65 [lua] ssl_certificate.lua:286: auto-ssl: could not get certificate for www.problematic-domain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:40:18 [error] 1342#1342: *75 [lua] ssl_certificate.lua:68: issue_cert(): auto-ssl: failed to obtain lock: timeout, context: ssl_certificate_by_lua*, client: 73.170.48.226, server: 0.0.0.0:443
2018/07/02 22:40:18 [error] 1342#1342: *75 [lua] ssl_certificate.lua:286: auto-ssl: could not get certificate for www.problematic-domain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 73.170.48.226, server: 0.0.0.0:443
2018/07/02 22:40:18 [error] 1342#1342: *73 [lua] ssl_certificate.lua:68: issue_cert(): auto-ssl: failed to obtain lock: timeout, context: ssl_certificate_by_lua*, client: 73.170.48.226, server: 0.0.0.0:443
2018/07/02 22:40:18 [error] 1342#1342: *73 [lua] ssl_certificate.lua:286: auto-ssl: could not get certificate for www.problematic-domain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 73.170.48.226, server: 0.0.0.0:443
2018/07/02 22:40:20 [error] 1342#1342: *82 [lua] ssl_certificate.lua:68: issue_cert(): auto-ssl: failed to obtain lock: timeout, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:40:20 [error] 1342#1342: *82 [lua] ssl_certificate.lua:286: auto-ssl: could not get certificate for www.problematic-domain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:40:51 [error] 1342#1342: *98 [lua] ssl_certificate.lua:68: issue_cert(): auto-ssl: failed to obtain lock: timeout, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:40:51 [error] 1342#1342: *98 [lua] ssl_certificate.lua:286: auto-ssl: could not get certificate for www.problematic-domain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:41:21 [error] 1342#1342: *105 [lua] ssl_certificate.lua:68: issue_cert(): auto-ssl: failed to obtain lock: timeout, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:41:21 [error] 1342#1342: *105 [lua] ssl_certificate.lua:286: auto-ssl: could not get certificate for www.problematic-domain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:41:54 [error] 1342#1342: *134 [lua] lets_encrypt.lua:41: issue_cert(): auto-ssl: dehydrated failed: env HOOK_SECRET=53905c0b9dbc62d0b4026d591231694ba00ce38fbb29c0d45f0f86f4441d25d4 HOOK_SERVER_PORT=8999 /usr/local/openresty/luajit/bin/resty-auto-ssl/dehydrated --cron --accept-terms --no-lock --domain www.problematic-domain.com --challenge http-01 --config /etc/resty-auto-ssl/letsencrypt/config --hook /usr/local/openresty/luajit/bin/resty-auto-ssl/letsencrypt_hooks status: 256 out: # INFO: Using main config file /etc/resty-auto-ssl/letsencrypt/config
Processing www.problematic-domain.com
+ Signing domains...
+ Generating private key...
+ Generating signing request...
+ Requesting authorization for www.problematic-domain.com...
+ 1 pending challenge(s)
+ Deploying challenge tokens...
+ Responding to challenge for www.problematic-domain.com authorization...
Invalid challenge: DOMAIN=www.problematic-domain.com RESPONSE={
"type": "http-01",
"status": "invalid",
"error": {
"type": "urn:acme:error:connection",
"detail": "Fetching https://www.problematic-domain.com/.well-known/acme-challenge/4cQiEWb7JmAFhgjTZBfbJHNVKvPuIq6w5aP0wlgzNI0: Error getting validation data",
"status": 400
},
"uri": "https://acme-v01.api.letsencrypt.org/acme/challenge/S_K4QA6wL-BIfVUgsjJu9VESFNMw92vTgRk6WblRV5M/5425284669",
"token": "4cQiEWb7JmAFhgjTZBfbJHNVKvPuIq6w5aP0wlgzNI0",
"keyAuthorization": "4cQiEWb7JmAFhgjTZBfbJHNVKvPuIq6w5aP0wlgzNI0.iyFwioEEU2gsGRl51ToulQDLULu1gh7mfp9cpKV_nD8",
"validationRecord": [
{
"url": "http://www.problematic-domain.com/.well-known/acme-challenge/4cQiEWb7JmAFhgjTZBfbJHNVKvPuIq6w5aP0wlgzNI0",
"hostname": "www.problematic-domain.com",
"port": "80",
"addressesResolved": [
"54.70.191.13"
],
"addressUsed": "54.70.191.13"
},
{
"url": "https://www.problematic-domain.com/.well-known/acme-challenge/4cQiEWb7JmAFhgjTZBfbJHNVKvPuIq6w5aP0wlgzNI0",
"hostname": "www.problematic-domain.com",
"port": "443",
"addressesResolved": [
"54.70.191.13"
],
"addressUsed": "54.70.191.13"
}
]
}
err: nil, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:41:54 [error] 1342#1342: *134 [lua] ssl_certificate.lua:97: issue_cert(): auto-ssl: issuing new certificate failed: dehydrated failure, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:41:54 [error] 1342#1342: *134 [lua] ssl_certificate.lua:286: auto-ssl: could not get certificate for www.problematic-domain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:42:22 [error] 1342#1342: *155 [lua] ssl_certificate.lua:68: issue_cert(): auto-ssl: failed to obtain lock: timeout, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:42:22 [error] 1342#1342: *155 [lua] ssl_certificate.lua:286: auto-ssl: could not get certificate for www.problematic-domain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:42:52 [error] 1342#1342: *176 [lua] ssl_certificate.lua:68: issue_cert(): auto-ssl: failed to obtain lock: timeout, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:42:52 [error] 1342#1342: *176 [lua] ssl_certificate.lua:286: auto-ssl: could not get certificate for www.problematic-domain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:43:59 [error] 1342#1342: *225 [lua] lets_encrypt.lua:41: issue_cert(): auto-ssl: dehydrated failed: env HOOK_SECRET=53905c0b9dbc62d0b4026d591231694ba00ce38fbb29c0d45f0f86f4441d25d4 HOOK_SERVER_PORT=8999 /usr/local/openresty/luajit/bin/resty-auto-ssl/dehydrated --cron --accept-terms --no-lock --domain www.problematic-domain.com --challenge http-01 --config /etc/resty-auto-ssl/letsencrypt/config --hook /usr/local/openresty/luajit/bin/resty-auto-ssl/letsencrypt_hooks status: 256 out: # INFO: Using main config file /etc/resty-auto-ssl/letsencrypt/config
Processing www.problematic-domain.com
+ Signing domains...
+ Generating private key...
+ Generating signing request...
+ Requesting authorization for www.problematic-domain.com...
+ 1 pending challenge(s)
+ Deploying challenge tokens...
+ Responding to challenge for www.problematic-domain.com authorization...
Invalid challenge: DOMAIN=www.problematic-domain.com RESPONSE={
"type": "http-01",
"status": "invalid",
"error": {
"type": "urn:acme:error:connection",
"detail": "Fetching https://www.problematic-domain.com/.well-known/acme-challenge/x8kjiHlvQdO_OdHETaDGe8yeIA-STfF1RMYEBO73x8k: Error getting validation data",
"status": 400
},
"uri": "https://acme-v01.api.letsencrypt.org/acme/challenge/T2K0FcEDMpCnD7jfqZZDp-qWQEcAFuQWW4YXKU3JQ4o/5425313962",
"token": "x8kjiHlvQdO_OdHETaDGe8yeIA-STfF1RMYEBO73x8k",
"keyAuthorization": "x8kjiHlvQdO_OdHETaDGe8yeIA-STfF1RMYEBO73x8k.iyFwioEEU2gsGRl51ToulQDLULu1gh7mfp9cpKV_nD8",
"validationRecord": [
{
"url": "http://www.problematic-domain.com/.well-known/acme-challenge/x8kjiHlvQdO_OdHETaDGe8yeIA-STfF1RMYEBO73x8k",
"hostname": "www.problematic-domain.com",
"port": "80",
"addressesResolved": [
"54.70.191.13"
],
"addressUsed": "54.70.191.13"
},
{
"url": "https://www.problematic-domain.com/.well-known/acme-challenge/x8kjiHlvQdO_OdHETaDGe8yeIA-STfF1RMYEBO73x8k",
"hostname": "www.problematic-domain.com",
"port": "443",
"addressesResolved": [
"54.70.191.13"
],
"addressUsed": "54.70.191.13"
}
]
}
err: nil, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:43:59 [error] 1342#1342: *225 [lua] ssl_certificate.lua:97: issue_cert(): auto-ssl: issuing new certificate failed: dehydrated failure, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:43:59 [error] 1342#1342: *225 [lua] ssl_certificate.lua:286: auto-ssl: could not get certificate for www.problematic-domain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:45:32 [error] 1342#1342: *264 [lua] lets_encrypt.lua:41: issue_cert(): auto-ssl: dehydrated failed: env HOOK_SECRET=53905c0b9dbc62d0b4026d591231694ba00ce38fbb29c0d45f0f86f4441d25d4 HOOK_SERVER_PORT=8999 /usr/local/openresty/luajit/bin/resty-auto-ssl/dehydrated --cron --accept-terms --no-lock --domain www.problematic-domain.com --challenge http-01 --config /etc/resty-auto-ssl/letsencrypt/config --hook /usr/local/openresty/luajit/bin/resty-auto-ssl/letsencrypt_hooks status: 256 out: # INFO: Using main config file /etc/resty-auto-ssl/letsencrypt/config
Processing www.problematic-domain.com
+ Signing domains...
+ Generating private key...
+ Generating signing request...
+ Requesting authorization for www.problematic-domain.com...
err: + ERROR: An error occurred while sending post-request to https://acme-v01.api.letsencrypt.org/acme/new-authz (Status 429)
Details:
{
"type": "urn:acme:error:rateLimited",
"detail": "Error creating new authz :: too many failed authorizations recently: see https://letsencrypt.org/docs/rate-limits/",
"status": 429
}
, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:45:32 [error] 1342#1342: *264 [lua] ssl_certificate.lua:97: issue_cert(): auto-ssl: issuing new certificate failed: dehydrated failure, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:45:32 [error] 1342#1342: *264 [lua] ssl_certificate.lua:286: auto-ssl: could not get certificate for www.problematic-domain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:45:35 [error] 1342#1342: *303 [lua] lets_encrypt.lua:41: issue_cert(): auto-ssl: dehydrated failed: env HOOK_SECRET=53905c0b9dbc62d0b4026d591231694ba00ce38fbb29c0d45f0f86f4441d25d4 HOOK_SERVER_PORT=8999 /usr/local/openresty/luajit/bin/resty-auto-ssl/dehydrated --cron --accept-terms --no-lock --domain www.problematic-domain.com --challenge http-01 --config /etc/resty-auto-ssl/letsencrypt/config --hook /usr/local/openresty/luajit/bin/resty-auto-ssl/letsencrypt_hooks status: 256 out: # INFO: Using main config file /etc/resty-auto-ssl/letsencrypt/config
Processing www.problematic-domain.com
+ Signing domains...
+ Generating private key...
+ Generating signing request...
+ Requesting authorization for www.problematic-domain.com...
err: + ERROR: An error occurred while sending post-request to https://acme-v01.api.letsencrypt.org/acme/new-authz (Status 429)
Details:
{
"type": "urn:acme:error:rateLimited",
"detail": "Error creating new authz :: too many failed authorizations recently: see https://letsencrypt.org/docs/rate-limits/",
"status": 429
}
, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:45:35 [error] 1342#1342: *303 [lua] ssl_certificate.lua:97: issue_cert(): auto-ssl: issuing new certificate failed: dehydrated failure, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:45:35 [error] 1342#1342: *303 [lua] ssl_certificate.lua:286: auto-ssl: could not get certificate for www.problematic-domain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 64.78.149.164, server: 0.0.0.0:443
2018/07/02 22:57:20 [emerg] 1148#1148: "server" directive is not allowed here in /usr/local/openresty/nginx/conf/nginx.conf:60
2018/07/02 23:08:12 [error] 1337#1337: *73 [lua] lets_encrypt.lua:41: issue_cert(): auto-ssl: dehydrated failed: env HOOK_SECRET=27862fd1c04cad17fbb6cdffedddbffc5c0b7a615c4cab7e25634e2987dbed2f HOOK_SERVER_PORT=8999 /usr/local/openresty/luajit/bin/resty-auto-ssl/dehydrated --cron --accept-terms --no-lock --domain www.problematic-domain.com --challenge http-01 --config /etc/resty-auto-ssl/letsencrypt/config --hook /usr/local/openresty/luajit/bin/resty-auto-ssl/letsencrypt_hooks status: 256 out: # INFO: Using main config file /etc/resty-auto-ssl/letsencrypt/config
Processing www.problematic-domain.com
+ Signing domains...
+ Generating private key...
+ Generating signing request...
+ Requesting authorization for www.problematic-domain.com...
err: + ERROR: An error occurred while sending post-request to https://acme-v01.api.letsencrypt.org/acme/new-authz (Status 429)
Details:
{
"type": "urn:acme:error:rateLimited",
"detail": "Error creating new authz :: too many failed authorizations recently: see https://letsencrypt.org/docs/rate-limits/",
"status": 429
}
, context: ssl_certificate_by_lua*, client: 77.88.5.23, server: 0.0.0.0:443
2018/07/02 23:08:12 [error] 1337#1337: *73 [lua] ssl_certificate.lua:97: issue_cert(): auto-ssl: issuing new certificate failed: dehydrated failure, context: ssl_certificate_by_lua*, client: 77.88.5.23, server: 0.0.0.0:443
2018/07/02 23:08:12 [error] 1337#1337: *73 [lua] ssl_certificate.lua:286: auto-ssl: could not get certificate for www.problematic-domain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 77.88.5.23, server: 0.0.0.0:443
I believe I'm running into this exact same issue. I'm running a few instances and the master nginx process crashes every hour or two. The last thing in the error log is always similar to this:
{
"type": "urn:acme:error:rateLimited",
"detail": "Error creating new authz :: too many failed authorizations recently: see https://letsencrypt.org/docs/rate-limits/",
"status": 429
}
So far, it's one domain in particular that's causing the issue, so I've instructed resty to ignore it, but is this something I should expect to happen any time we get a rateLimit error?
This also happens for us when we try to request a domain that for some reason is not reachable by the letsencrypt check (for example when DNS has not propagated yet). In this case the logs just say nothing and the server does not respond until it's restarted. At the moment we ship around this by automatically issuing a restart when this happens, but it would be nice if we could get to the bottom of the locking issue.
Background info: We allow our users to access parts of our service via their own domains, and at the moment nginx has no way of finding out the exact set of domains to allow for issuing letsencrypt certs. So we just disallow issuing for a few known cases (IP addresses for example) but the 'fallback' is to allow requesting a cert.
If I can provide you with additional information or logs, please let me know!
@benzimmer I've noticed the exact same behavior.
At the moment we ship around this by automatically issuing a restart when this happens, but it would be nice if we could get to the bottom of the locking issue.
Are you somehow detecting the error as it happens? If so, how are you doing that?
@joshnabbott I just run monit with a http check on localhost, once it does not respond it automatically restarts nginx.
@benzimmer nice! I’ve been running a cron to perform a check with a script I write. Not sure why I didn’t think about monit first 🙃
Hit this issue almost every hour. Have to run an infinite loop to check Nginx status and kill Nginx processes if they are not responsive.
@RainFlying have you debugged it at all? Maybe if we all compared notes we could figure out what’s causing it. Anytime it’s happened to me and I’ve checked the logs before restarting, the thing in the error log is regarding a dehydrated failure. @benzimmer have you debugged this at all?
Anyway, I put a monit http check on each of my instances and haven’t had an outage since.
@RainFlying have you debugged it at all? Maybe if we all compared notes we could figure out what’s causing it. Anytime it’s happened to me and I’ve checked the logs before restarting, the thing in the error log is regarding a dehydrated failure. @benzimmer have you debugged this at all?
Anyway, I put a monit http check on each of my instances and haven’t had an outage since.
Hi, I haven't debugged yet. But after I turned off resty-auto-ssl, the Nginx processes work normally all the time. I guess it may be caused by the number of domains handled by resty-auto-ssl, the single Nginx server handled the ssl certificate application of almost one thousand domains.
@RainFlying it could be that, too. We also serve certs for about 1200 domains. Maybe with more workers it would function better?
@RainFlying it could be that, too. We also serve certs for about 1200 domains. Maybe with more workers it would function better?
I set the worker_processes as auto, Nginx should fork enough workers to process those requests? May I ask what number of worker_processes have you set? It may also be caused by lots of failed authorization requests, some of the name servers which locate in China mainland have issues serving DNS requests from places outside China. I have seen quite a few failed requests in the log file.
@RainFlying I just have ours set to the standard one worker per core, so 4 per instance. We run 4 instances total and we serve all requests with ssl. It does seem to be related to failed authorization requests.
@GUI Any update on this? We've also been hit by this problem in a server with just 6 domains. Same urn:acme:error:rateLimited and shared memory zone "auto_ssl" was locked messages as others have already posted.
Getting this with just a couple of domains. Multiple times within an hour.
Having this issue as well, any solutions yet?
@shaunc869 I’d suggest using monit and adding an http check on the host (ours is a simple curl with a timeout flag). It’s not fixing the root of the problem, but we haven’t had issues since we added this.
@shaunc869
Just in case if anyone run into this issue. This could also be caused by Cloudflare. "Always redirect to SSL" option. It cut off HTTP traffic to acme challenge.
You can validate this by checking HTTP status as below:
curl http://domain.com/.well-known/acme-challenge/
make sure this has HTTP traffic follow through and it returns 404 instead of 301.
Thanks @Adel-Magebinary for the insight, that's what happens in our case.
Unfortunately for people generating certs for their customers, thus not having access to their customer CloudFlare account to disable "Always redirect to SSL", there doesn't seem to by any solution. Or is there?
There's a post on CloudFlare requesting them to create an exception for LE validation. Perhaps if there's a lot of requests for it they'll do something about it at some point...
Interestingly though valid certs seem to be created at some point regardless of the error and prior to it.
When we try to reach the domain over HTTPS the request takes some time because auto-ssl is attempting to order a new cert, that fails because of CloudFlare's "Always redirect to SSL", but after the failure the page actually shows up with a valid cert. Yet, further requests will again attempt to order a new cert (taking some time, and fail again).
Is it possible LE actually succeeded in accessing /.well-known/acme-challenge/ over a broken HTTPS initially, but reported an error/warning, and auto-ssl is thus trying to re-issue a cert every time the domain is accessed again over HTTPS? If that's the case then there is a possible fix: ignore LE's warning/error when a cert was actually validated earlier. This is just a guess, I'm not familiar with auto-ssl and LE's validation process.
LE's HTTP challenge documentation says this:
Our implementation of the HTTP-01 challenge follows redirects, up to 10 redirects deep. It only accepts redirects to “http:” or “https:”, and only to ports 80 or 443. It does not accept redirects to IP addresses. When redirected to an HTTPS URL, it does not validate certificates (since this challenge is intended to bootstrap valid certificates, it may encounter self-signed or expired certificates along the way).
The outlined part suggests an initial challenge over HTTP to /.well-known/acme-challenge/ redirecting to HTTPS (because of CloudFlare's "Always redirect to SSL") would succeed even if the cert is not valid/there is no cert yet.
Our error logs are similar to ones reported by @bryanus above, apart that we get a 403 instead of a 400 error on the challenge. We also notice despite the validation status being "invalid" the presence of "validated": "2021-12-02T11:44:48Z" (32 seconds earlier). At the given date and time we find no auto-ssl error.
Initial failure:
2021/12/02 11:45:20 [error] 10122#10122: *1102095 [lua] lets_encrypt.lua:40: issue_cert(): auto-ssl: dehydrated failed: env HOOK_SECRET=obfuscatedsecret HO
OK_SERVER_PORT=8999 /usr/local/bin/resty-auto-ssl/dehydrated --cron --accept-terms --no-lock --domain obfuscatedomain.com --challenge http-01 --config /etc/resty-auto-ssl/letsencrypt/config --hook /usr/l
ocal/bin/resty-auto-ssl/letsencrypt_hooks status: 256 out: # INFO: Using main config file /etc/resty-auto-ssl/letsencrypt/config
startup_hook
Processing obfuscatedomain.com
+ Creating new directory /etc/resty-auto-ssl/letsencrypt/certs/obfuscatedomain.com ...
+ Signing domains...
+ Generating private key...
+ Generating signing request...
+ Requesting new certificate order from CA...
+ Received 1 authorizations URLs from the CA
+ Handling authorization for obfuscatedomain.com
+ 1 pending challenge(s)
+ Deploying challenge tokens...
deploy_challenge
+ Responding to challenge for obfuscatedomain.com authorization...
invalid_challenge
Invalid challenge: DOMAIN=obfuscatedomain.com RESPONSE={
"type": "http-01",
"status": "invalid",
"error": {
"type": "urn:ietf:params:acme:error:unauthorized",
"detail": "Invalid response from https://obfuscatedomain.com/.well-known/acme-challenge/OZ08EpHakmp-jz12UerK8bHsSQORKIEQxQWosjztF-w [2606:4700:20::681a:628]: \"\u003chtml\u003e\\n\u003chead\u003e\u00
3ctitle\u003e404 Not Found\u003c/title\u003e\u003c/head\u003e\\n\u003cbody\u003e\\n\u003ccenter\u003e\u003ch1\u003e404 Not Found\u003c/h1\u003e\u003c/center\u003e\\n\u003chr\u003e\u003ccenter\u003eopenre
sty\u003c/center\u003e\\n\u003c\"",
"status": 403
},
"url": "https://acme-v02.api.letsencrypt.org/acme/chall-v3/54755184260/WMuMfw",
"token": "OZ08EpHakmp-jz12UerK8bHsSQORKIEQxQWosjztF-w",
"validationRecord": [
{
"url": "http://obfuscatedomain.com/.well-known/acme-challenge/OZ08EpHakmp-jz12UerK8bHsSQORKIEQxQWosjztF-w",
"hostname": "obfuscatedomain.com",
"port": "80",
"addressesResolved": [
"104.26.7.40",
"104.26.6.40",
"172.67.74.121",
"2606:4700:20::ac43:4a79",
"2606:4700:20::681a:628",
"2606:4700:20::681a:728"
],
"addressUsed": "2606:4700:20::ac43:4a79"
},
{
"url": "https://obfuscatedomain.com/.well-known/acme-challenge/OZ08EpHakmp-jz12UerK8bHsSQORKIEQxQWosjztF-w",
"hostname": "obfuscatedomain.com",
"port": "443",
"addressesResolved": [
"172.67.74.121",
"104.26.6.40",
"104.26.7.40",
"2606:4700:20::681a:628",
"2606:4700:20::ac43:4a79",
"2606:4700:20::681a:728"
],
"addressUsed": "2606:4700:20::681a:628"
}
],
"validated": "2021-12-02T11:44:48Z"
}
err: nil, context: ssl_certificate_by_lua*, client: 162.158.91.18, server: 0.0.0.0:443
2021/12/02 11:45:20 [error] 10122#10122: *1102095 [lua] ssl_certificate.lua:97: issue_cert(): auto-ssl: issuing new certificate failed: dehydrated failure, context: ssl_certificate_by_lua*, client: 162.1
58.91.18, server: 0.0.0.0:443
2021/12/02 11:45:20 [error] 10122#10122: *1102095 [lua] ssl_certificate.lua:53: issue_cert_unlock(): auto-ssl: failed to unlock: lock does not match expected value, context: ssl_certificate_by_lua*, clie
nt: 162.158.91.18, server: 0.0.0.0:443
2021/12/02 11:45:20 [error] 10122#10122: *1102095 [lua] ssl_certificate.lua:291: auto-ssl: could not get certificate for obfuscatedomain.com - using fallback - failed to get or issue certificate, context
: ssl_certificate_by_lua*, client: 162.158.91.18, server: 0.0.0.0:443
After a while:
2021/12/03 10:22:09 [error] 15347#15347: *1556412 [lua] lets_encrypt.lua:40: issue_cert(): auto-ssl: dehydrated failed: env HOOK_SECRET=obfuscatedsecret HOOK_SERVER_PORT=8999 /usr/local/bin/resty-auto-ssl/dehydrated --cron --accept-terms --no-lock --domain obfuscatedomain.com --challenge http-01 --config /etc/resty-auto-ssl/letsencrypt/config --hook /usr/local/bin/resty-auto-ssl/letsencrypt_hooks status: 256 out: # INFO: Using main config file /etc/resty-auto-ssl/letsencrypt/config
startup_hook
Processing obfuscatedomain.com
+ Signing domains...
+ Generating private key...
+ Generating signing request...
+ Requesting new certificate order from CA...
err: + ERROR: An error occurred while sending post-request to https://acme-v02.api.letsencrypt.org/acme/new-order (Status 429)
Details:
HTTP/2 429
server: nginx
date: Fri, 03 Dec 2021 10:22:09 GMT
content-type: application/problem+json
content-length: 201
boulder-requester: 123900029
cache-control: public, max-age=0, no-cache
link: <https://acme-v02.api.letsencrypt.org/directory>;rel="index"
replay-nonce: 0102O2mIbRuBK2Fv2Kq5eZhD8XabKKslr_lLStswyR2qQrc
{
"type": "urn:ietf:params:acme:error:rateLimited",
"detail": "Error creating new order :: too many failed authorizations recently: see https://letsencrypt.org/docs/rate-limits/",
"status": 429
}
, context: ssl_certificate_by_lua*, client: 162.158.252.253, server: 0.0.0.0:443
2021/12/03 10:22:09 [error] 15347#15347: *1556412 [lua] ssl_certificate.lua:97: issue_cert(): auto-ssl: issuing new certificate failed: dehydrated failure, context: ssl_certificate_by_lua*, client: 162.158.252.253, server: 0.0.0.0:443
2021/12/03 10:22:09 [error] 15347#15347: *1556412 [lua] ssl_certificate.lua:291: auto-ssl: could not get certificate for obfuscatedomain.com - using fallback - failed to get or issue certificate, context: ssl_certificate_by_lua*, client: 162.158.252.253, server: 0.0.0.0:443
@DoobleD @GUI maybe there is a simple fix by triggering the certificate renewal 1 week before its expiration date?
Then we won't be running into the issue of an invalid certificate and the .acme will still be accessible via SSL.
Thanks for the suggestion @Adel-Magebinary! We've let the error run for now, we'll perhaps try that idea when we get back to it.