plugins
plugins copied to clipboard
backup: async backup-compact
Continues #286
With this PR, calling backup-compact now returns (almost) immediate, without blocking c-lightning. Compaction runs in separate threat, while CL operates as normal.
basic principle:
- record the latest
data_versionof current backup - compact up-to
data_versioninto clone (here is the bulk of work and thread switching to handledb_writehook) - block
add_changein main thread (this blocks CL), refresh latestdata_versionand make the clone catch-up with it by adding missingversions - atomically move clone-->backup and unblock
Before this PR, backup-compact call would return a dict stats when compaction completed.
Now it returns immediately {"result": "compaction started"}, {"result": "compaction still in progress"} or
{"result": {"backupsize": <size_in_bytes>, "version_count": 2}} when there is nothing to compact.
The stats can be found in log, no idea who uses these.
All pretty much self-contained in FileBackend, so it also works with SocketBackend (which uses FileBackend on server side). The double logging issue #232 is also fixed, maybe this can be improved further, see task list.
Also includes some minor fixes/cleanups and comments, mostly for my own understanding.
I manual tested compacting + restoring a backup and sqldiff with original and also tested with remote SocketBackend.
With ld versions (v0.10.2+) that support "shutdown", an unfinished compaction (.compacting file) is cleaned up when shutting down. For users running v0.10.2 before fix #4959, the plugin waits to be killed as last by the timeout in shutdown, seems prudent.
Also retired the deprecated --backup-destination option.
Todo:
- [x] add tests
- ~make entire
FileBackendthread safe, i.e. alsorewind~ in retrospect, is overkill - ~make all logs in FileBackend being forwarded by
SocketBackend~ -> out of scope, server should handle these - [x] documentation, include example
systemdscript for weekly compacting
Other ideas:
- benchmark
backup.pyand compare with other backup methods - replace
threadingwithmultiprocessing - autocompact when backup size passes a threshold
548dd09 fixes pip error: pytest-xdist 2.2.1 requires pytest>=6.0.0, but you have pytest 5.4.3 which is incompatible.
when installing requirements-dev.txt, because pyln-testing-0.9.3 needed pytest-5.*
after 548dd09 and python3 -m pip install -r requirements.txt and python3 -m pip install -r requirements-dev.txt
then pip list includes:
pyln-bolt7 1.0.2.186
pyln-client 0.10.2.post0
pyln-proto 0.10.2.post0
pyln-testing 0.10.1
The failing integration tests (not backup.py) doesn't seem to be caused by the last commit, but by newer CL version post190 checked-out (previous post171 passed all tests).
test_drain_and_refill fails on this, not raising an error:
https://github.com/lightningd/plugins/blob/22923724a99ff2a1d92f753b32b37cd1f0b224a9/drain/test_drain.py#L66-L70
Initially I thought maybe option_zero_htlc_tx_fee got activated, making above not raising the error.
The test start failing at commit 0aad222c2 in #4972, so I assume that enabled it?
edit: option_zero_htlc_tx_fee is not enabled in l1's nodes feautures bool(0x80008808226aa2 & ((1<<22) | (1<<23))) = False
@cdecker I think this is ready for review/merge (if desired). The failing integration tests is not caused by this PR, see above.
@Mergifyio rebase
rebase
✅ Branch has been successfully rebased
Rebased on master, not sure what the mergify rebase did.
Only added two commits, to address type safety and make it easier to review (I hope!).
Wouldn't it be better to make the response synchronous
I looked into that, but it probably requires yet another thread (to keep handling db_write's) and modifications to SocketBackend also.
Keeping the request pending even for long times is practically free and we wouldn't have to poll.
Is there maybe an example of this with Plugin class? I don't fully understand how it
works, can Plugin's io loop returning pending request later?
if not method.background:
# Only if this is not an async (background) call do we need to
# return the result, otherwise the callee will eventually need
# to call request.set_result or request.set_exception to
# return a result or raise an exception.
request.set_result(result)
But above suggestions are interesting to explore, however lets first see how this performs in comparison to other backup methods. Any idea how broadly backup.py is used?
Rebased on master, not sure what the mergify rebase did.
No worries mergify just took your changes and rebased them on the current master branch. This is so we don't create merge commits on merge, rather just fast-forward the changes on the branch.
Only added two commits, to address type safety and make it easier to review (I hope!).
Definitely good to review. Personally I go with fixups, but our review bot helps here: https://bot.bitcoinstats.com/lightningd/plugins/pull/296/range_diff/548dd09a2b00589a8921f3b9b134d49ddc5b870b..df1c96ade1709e31d371a5be4b4579293f656c65
Wouldn't it be better to make the response synchronous
I looked into that, but it probably requires yet another thread (to keep handling
db_write's) and modifications toSocketBackendalso.
So you are already creating a new thread here:
https://github.com/lightningd/plugins/pull/296/files#diff-c03113fbe9914787cc6f747d5d0ba792c352b16c6cf7dba7f85d40e179acd090R153
Making the response asynchronous would just mean that we pass the request object to compact(). Then compact() proceeds to do its magic, and when it's done it'll call set_result() or set_exception() to report success or failure. The @plugin.async_method decorator basically means "don't use the return value of the annotated function to return the result to the caller, instead defer that to the request object being used", so when the command handler (the annotated function) returns, pyln will not return the result, instead it'll wait until it get notified by the request handle, freeing the control flow from the simple call-response that @plugin.method has.
Keeping the request pending even for long times is practically free and we wouldn't have to poll.
Is there maybe an example of this with
Pluginclass? I don't fully understand how it works, can Plugin's io loop returning pending request later?if not method.background: # Only if this is not an async (background) call do we need to # return the result, otherwise the callee will eventually need # to call request.set_result or request.set_exception to # return a result or raise an exception. request.set_result(result)But above suggestions are interesting to explore, however lets first see how this performs in comparison to other backup methods. Any idea how broadly
backup.pyis used?
One example could be the recvmsg method of the noise plugin:
https://github.com/lightningd/plugins/blob/6f2b8fb6374600f9e1e06e57884bde99d85383bf/noise/noise.py#L152-L160
It shows both pseudo-sync flow and async flow: if we already have a message ready to return when we get called we set the result immediately, effectively emulating a @plugin.method since we don't defer resolution, but also don't use the return value of the method as the result:
https://github.com/lightningd/plugins/blob/6f2b8fb6374600f9e1e06e57884bde99d85383bf/noise/noise.py#L158
If on the other hand we don't yet have a message to return, we need to defer returning the result, and free the main thread so we can get hook calls, notification etc, which are used to fill the message queue. We do this by simply remembering that there is a pending request in a list of subscribers.
https://github.com/lightningd/plugins/blob/6f2b8fb6374600f9e1e06e57884bde99d85383bf/noise/noise.py#L160
When we later get an htlc_accepted hook call that tells us about an incoming message we then iterate through the waitiing requests and return the result:
https://github.com/lightningd/plugins/blob/6f2b8fb6374600f9e1e06e57884bde99d85383bf/noise/noise.py#L205-L212
This is with just a single thread, because we either return immediately, or we stash the requests away, waiting for a hook call. In your case we'd simply pass the request to the thread doing the compaction and notifying the caller when we're done :-)
Any chance we could keep the compact() method agnostic of being run in a thread or on the main thread? Maybe a wrapper function that takes care of driving the compaction thread, and handles request, might be a good idea (testing the compaction code is likely easier when it doesn't spawn threads internally).
Thanks a lot for the explanation.
Making the response asynchronous would just mean that we pass the request object to compact()
In your case we'd simply pass the request to the thread doing the compaction and notifying the caller when we're done :-)
For a local FileBackend above seems possible. But SocketBackend's is fully synchronous and also cannot pass a request object to the remote server (where its FileBackend is located).
https://github.com/lightningd/plugins/blob/b65fbc541f1bae3cb90fc21166e860aadd8fb599/backup/socketbackend.py#L215-L219
Any chance we could keep the compact() method agnostic of being run in a thread or on the main thread?
Yes, just lock the backup file when writing data, or when reading metadata.
But all above requires SocketBackend's add_change() and compact() methods to be asynchronous. Maybe it can be done with async module or such, but to me it doesn't seem worth the effort and added complexity.
report size in bytes and number of changes before and after, which should be more informative
Agreed, but who needs this info? As a user (myself) I just want my backup not fill up my usb or ssd drive, which got to 7GB after a few weeks running clboss. Maybe the backup-compact method can be removed entirely and instead add an auto-compact option with a size parameter. So that FileBackend start a compacting process when backup passes certain size.
In your case we'd simply pass the request to the thread doing the compaction and notifying the caller when we're done :-)
For a local
FileBackendabove seems possible. ButSocketBackend's is fully synchronous and also cannot pass arequestobject to the remote server (where itsFileBackendis located).
Good point, since the backend itself is controlling storage medium access we need to have it also drive the async compaction, I hadn't thought about it. So the SocketBackend either doesn't support async compression, or it has to be adapted in order to do so. Still the client side would just be a thread holding the incoming request open, and waiting for the server to signal that it finished the compaction.
Any chance we could keep the compact() method agnostic of being run in a thread or on the main thread?
Yes, just lock the backup file when writing data, or when reading metadata.
But all above requires
SocketBackend's add_change() and compact() methods to be asynchronous. Maybe it can be done withasyncmodule or such, but to me it doesn't seem worth the effort and added complexity.
That makes the compaction synchronous again, i.e., the db_write hook will get locked out from the file until the compaction is done. I was thinking of having something like this:
class AsyncBackend(Backend):
def __init__(self, inner: Backend):
self.inner = inner # The actual backend we're using to store the data
self.lock = threading.Lock() # Allows the compaction to grab control of the file while doing the atomic swap
def add_change(self, change):
with self.lock:
self.inner.add_change(change)
def compact(self, request):
def do_compact():
self.inner.compact()
request.set_result("All done")
threading.Thread(target=do_compact, daemon=True).start()
Though maybe the compaction must be split into two steps: prepare_compact() which can be run in parallel while accepting db_writes and a second much shorter step compact() that needs to have exclusive control over the storage medium, i.e., atomic file swap in the FileBackend case.
Anyway, I like discussing these more architectural things, but happy to merge as is to unblock others waiting on this ^^
report size in bytes and number of changes before and after, which should be more informative
Agreed, but who needs this info? As a user (myself) I just want my backup not fill up my usb or ssd drive, which got to 7GB after a few weeks running clboss. Maybe the backup-compact method can be removed entirely and instead add an auto-compact option with a size parameter. So that
FileBackendstart a compacting process when backup passes certain size.
Yeah, just thought that it'd be nice for the user to see progress (there's the custom notification support that can report progress to the CLI), or indicate that the compaction is still pending. I'm very suspicious if my laptop burns through its battery despite the command I'd blame for it returned in just a second. Just cosmetics at this point really :-)
Any chance we could keep the compact() method agnostic of being run in a thread or on the main thread?
Yes, just lock the backup file when writing data, or when reading metadata.
That makes the compaction synchronous again, i.e., the
db_writehook will get locked out from the file until the compaction is done.
That is incorrect, above comment should be read more precise and look at code inside the with self.lock: context.
lock the backup file when writing data, https://github.com/lightningd/plugins/blob/df1c96ade1709e31d371a5be4b4579293f656c65/backup/filebackend.py#L76-L87
, or when reading metadata. https://github.com/lightningd/plugins/blob/df1c96ade1709e31d371a5be4b4579293f656c65/backup/filebackend.py#L171-L175
But I forgot to mention the last part where it also locks:
- when adding catch-up changes to the clone that where added (to the backup) while compacting
- during atomic swap https://github.com/lightningd/plugins/blob/df1c96ade1709e31d371a5be4b4579293f656c65/backup/filebackend.py#L228-L235
Though maybe the compaction must be split into two steps
I think it is already about what you suggest here :-)
happy to merge as is to unblock others waiting on this ^^
As you wish, I already run this on my own node. But I was a little worried about users running current backup.py on v0.10.2 pre #4945 maybe getting backup is out of date errors but haven't seen these yet. The ones in issue 4994 seem causes by other reasons (power failure).
Rebased on latest master (nothing else). There always appears to be some problem with the testing system, this time a network error?
Collecting pip
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01e6d0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01e820>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01e970>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01eac0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01ec10>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl
ERROR: Could not install packages due to an EnvironmentError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Max retries exceeded with url: /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01ed60>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
Error: Process completed with exit code 1.
Needs rebase, CI should be working now