plugins icon indicating copy to clipboard operation
plugins copied to clipboard

backup: async backup-compact

Open SimonVrouwe opened this issue 4 years ago • 13 comments
trafficstars

Continues #286

With this PR, calling backup-compact now returns (almost) immediate, without blocking c-lightning. Compaction runs in separate threat, while CL operates as normal.

basic principle:

  1. record the latest data_version of current backup
  2. compact up-to data_version into clone (here is the bulk of work and thread switching to handle db_write hook)
  3. block add_change in main thread (this blocks CL), refresh latest data_version and make the clone catch-up with it by adding missing versions
  4. atomically move clone-->backup and unblock

Before this PR, backup-compact call would return a dict stats when compaction completed. Now it returns immediately {"result": "compaction started"}, {"result": "compaction still in progress"} or {"result": {"backupsize": <size_in_bytes>, "version_count": 2}} when there is nothing to compact. The stats can be found in log, no idea who uses these.

All pretty much self-contained in FileBackend, so it also works with SocketBackend (which uses FileBackend on server side). The double logging issue #232 is also fixed, maybe this can be improved further, see task list. Also includes some minor fixes/cleanups and comments, mostly for my own understanding.

I manual tested compacting + restoring a backup and sqldiff with original and also tested with remote SocketBackend. With ld versions (v0.10.2+) that support "shutdown", an unfinished compaction (.compacting file) is cleaned up when shutting down. For users running v0.10.2 before fix #4959, the plugin waits to be killed as last by the timeout in shutdown, seems prudent.

Also retired the deprecated --backup-destination option.

Todo:

  • [x] add tests
  • ~make entire FileBackend thread safe, i.e. also rewind~ in retrospect, is overkill
  • ~make all logs in FileBackend being forwarded by SocketBackend~ -> out of scope, server should handle these
  • [x] documentation, include example systemd script for weekly compacting

Other ideas:

  • benchmark backup.py and compare with other backup methods
  • replace threading with multiprocessing
  • autocompact when backup size passes a threshold

SimonVrouwe avatar Sep 02 '21 14:09 SimonVrouwe

548dd09 fixes pip error: pytest-xdist 2.2.1 requires pytest>=6.0.0, but you have pytest 5.4.3 which is incompatible. when installing requirements-dev.txt, because pyln-testing-0.9.3 needed pytest-5.*

after 548dd09 and python3 -m pip install -r requirements.txt and python3 -m pip install -r requirements-dev.txt then pip list includes:

pyln-bolt7                    1.0.2.186
pyln-client                   0.10.2.post0
pyln-proto                    0.10.2.post0
pyln-testing                  0.10.1

The failing integration tests (not backup.py) doesn't seem to be caused by the last commit, but by newer CL version post190 checked-out (previous post171 passed all tests).

SimonVrouwe avatar Dec 30 '21 11:12 SimonVrouwe

test_drain_and_refill fails on this, not raising an error: https://github.com/lightningd/plugins/blob/22923724a99ff2a1d92f753b32b37cd1f0b224a9/drain/test_drain.py#L66-L70

Initially I thought maybe option_zero_htlc_tx_fee got activated, making above not raising the error. The test start failing at commit 0aad222c2 in #4972, so I assume that enabled it?

edit: option_zero_htlc_tx_fee is not enabled in l1's nodes feautures bool(0x80008808226aa2 & ((1<<22) | (1<<23))) = False

SimonVrouwe avatar Dec 30 '21 17:12 SimonVrouwe

@cdecker I think this is ready for review/merge (if desired). The failing integration tests is not caused by this PR, see above.

SimonVrouwe avatar Jan 03 '22 09:01 SimonVrouwe

@Mergifyio rebase

cdecker avatar Jan 04 '22 17:01 cdecker

rebase

✅ Branch has been successfully rebased

mergify[bot] avatar Jan 04 '22 17:01 mergify[bot]

Rebased on master, not sure what the mergify rebase did.

Only added two commits, to address type safety and make it easier to review (I hope!).

Wouldn't it be better to make the response synchronous

I looked into that, but it probably requires yet another thread (to keep handling db_write's) and modifications to SocketBackend also.

Keeping the request pending even for long times is practically free and we wouldn't have to poll.

Is there maybe an example of this with Plugin class? I don't fully understand how it works, can Plugin's io loop returning pending request later?

            if not method.background:
                # Only if this is not an async (background) call do we need to
                # return the result, otherwise the callee will eventually need
                # to call request.set_result or request.set_exception to
                # return a result or raise an exception.
                request.set_result(result)

But above suggestions are interesting to explore, however lets first see how this performs in comparison to other backup methods. Any idea how broadly backup.py is used?

SimonVrouwe avatar Jan 06 '22 12:01 SimonVrouwe

Rebased on master, not sure what the mergify rebase did.

No worries mergify just took your changes and rebased them on the current master branch. This is so we don't create merge commits on merge, rather just fast-forward the changes on the branch.

Only added two commits, to address type safety and make it easier to review (I hope!).

Definitely good to review. Personally I go with fixups, but our review bot helps here: https://bot.bitcoinstats.com/lightningd/plugins/pull/296/range_diff/548dd09a2b00589a8921f3b9b134d49ddc5b870b..df1c96ade1709e31d371a5be4b4579293f656c65

Wouldn't it be better to make the response synchronous

I looked into that, but it probably requires yet another thread (to keep handling db_write's) and modifications to SocketBackend also.

So you are already creating a new thread here:

https://github.com/lightningd/plugins/pull/296/files#diff-c03113fbe9914787cc6f747d5d0ba792c352b16c6cf7dba7f85d40e179acd090R153

Making the response asynchronous would just mean that we pass the request object to compact(). Then compact() proceeds to do its magic, and when it's done it'll call set_result() or set_exception() to report success or failure. The @plugin.async_method decorator basically means "don't use the return value of the annotated function to return the result to the caller, instead defer that to the request object being used", so when the command handler (the annotated function) returns, pyln will not return the result, instead it'll wait until it get notified by the request handle, freeing the control flow from the simple call-response that @plugin.method has.

Keeping the request pending even for long times is practically free and we wouldn't have to poll.

Is there maybe an example of this with Plugin class? I don't fully understand how it works, can Plugin's io loop returning pending request later?

            if not method.background:
                # Only if this is not an async (background) call do we need to
                # return the result, otherwise the callee will eventually need
                # to call request.set_result or request.set_exception to
                # return a result or raise an exception.
                request.set_result(result)

But above suggestions are interesting to explore, however lets first see how this performs in comparison to other backup methods. Any idea how broadly backup.py is used?

One example could be the recvmsg method of the noise plugin:

https://github.com/lightningd/plugins/blob/6f2b8fb6374600f9e1e06e57884bde99d85383bf/noise/noise.py#L152-L160

It shows both pseudo-sync flow and async flow: if we already have a message ready to return when we get called we set the result immediately, effectively emulating a @plugin.method since we don't defer resolution, but also don't use the return value of the method as the result:

https://github.com/lightningd/plugins/blob/6f2b8fb6374600f9e1e06e57884bde99d85383bf/noise/noise.py#L158

If on the other hand we don't yet have a message to return, we need to defer returning the result, and free the main thread so we can get hook calls, notification etc, which are used to fill the message queue. We do this by simply remembering that there is a pending request in a list of subscribers.

https://github.com/lightningd/plugins/blob/6f2b8fb6374600f9e1e06e57884bde99d85383bf/noise/noise.py#L160

When we later get an htlc_accepted hook call that tells us about an incoming message we then iterate through the waitiing requests and return the result:

https://github.com/lightningd/plugins/blob/6f2b8fb6374600f9e1e06e57884bde99d85383bf/noise/noise.py#L205-L212

This is with just a single thread, because we either return immediately, or we stash the requests away, waiting for a hook call. In your case we'd simply pass the request to the thread doing the compaction and notifying the caller when we're done :-)

Any chance we could keep the compact() method agnostic of being run in a thread or on the main thread? Maybe a wrapper function that takes care of driving the compaction thread, and handles request, might be a good idea (testing the compaction code is likely easier when it doesn't spawn threads internally).

cdecker avatar Jan 07 '22 13:01 cdecker

Thanks a lot for the explanation.

Making the response asynchronous would just mean that we pass the request object to compact()

In your case we'd simply pass the request to the thread doing the compaction and notifying the caller when we're done :-)

For a local FileBackend above seems possible. But SocketBackend's is fully synchronous and also cannot pass a request object to the remote server (where its FileBackend is located). https://github.com/lightningd/plugins/blob/b65fbc541f1bae3cb90fc21166e860aadd8fb599/backup/socketbackend.py#L215-L219

Any chance we could keep the compact() method agnostic of being run in a thread or on the main thread?

Yes, just lock the backup file when writing data, or when reading metadata.

But all above requires SocketBackend's add_change() and compact() methods to be asynchronous. Maybe it can be done with async module or such, but to me it doesn't seem worth the effort and added complexity.

report size in bytes and number of changes before and after, which should be more informative

Agreed, but who needs this info? As a user (myself) I just want my backup not fill up my usb or ssd drive, which got to 7GB after a few weeks running clboss. Maybe the backup-compact method can be removed entirely and instead add an auto-compact option with a size parameter. So that FileBackend start a compacting process when backup passes certain size.

SimonVrouwe avatar Jan 08 '22 13:01 SimonVrouwe

In your case we'd simply pass the request to the thread doing the compaction and notifying the caller when we're done :-)

For a local FileBackend above seems possible. But SocketBackend's is fully synchronous and also cannot pass a request object to the remote server (where its FileBackend is located).

Good point, since the backend itself is controlling storage medium access we need to have it also drive the async compaction, I hadn't thought about it. So the SocketBackend either doesn't support async compression, or it has to be adapted in order to do so. Still the client side would just be a thread holding the incoming request open, and waiting for the server to signal that it finished the compaction.

Any chance we could keep the compact() method agnostic of being run in a thread or on the main thread?

Yes, just lock the backup file when writing data, or when reading metadata.

But all above requires SocketBackend's add_change() and compact() methods to be asynchronous. Maybe it can be done with async module or such, but to me it doesn't seem worth the effort and added complexity.

That makes the compaction synchronous again, i.e., the db_write hook will get locked out from the file until the compaction is done. I was thinking of having something like this:

class AsyncBackend(Backend):
  def __init__(self, inner: Backend):
    self.inner = inner  # The actual backend we're using to store the data
    self.lock = threading.Lock()  # Allows the compaction to grab control of the file while doing the atomic swap

  def add_change(self, change):
    with self.lock:
      self.inner.add_change(change)

  def compact(self, request):
    def do_compact():
      self.inner.compact()
      request.set_result("All done")
    threading.Thread(target=do_compact, daemon=True).start()

Though maybe the compaction must be split into two steps: prepare_compact() which can be run in parallel while accepting db_writes and a second much shorter step compact() that needs to have exclusive control over the storage medium, i.e., atomic file swap in the FileBackend case.

Anyway, I like discussing these more architectural things, but happy to merge as is to unblock others waiting on this ^^

report size in bytes and number of changes before and after, which should be more informative

Agreed, but who needs this info? As a user (myself) I just want my backup not fill up my usb or ssd drive, which got to 7GB after a few weeks running clboss. Maybe the backup-compact method can be removed entirely and instead add an auto-compact option with a size parameter. So that FileBackend start a compacting process when backup passes certain size.

Yeah, just thought that it'd be nice for the user to see progress (there's the custom notification support that can report progress to the CLI), or indicate that the compaction is still pending. I'm very suspicious if my laptop burns through its battery despite the command I'd blame for it returned in just a second. Just cosmetics at this point really :-)

cdecker avatar Jan 09 '22 11:01 cdecker

Any chance we could keep the compact() method agnostic of being run in a thread or on the main thread?

Yes, just lock the backup file when writing data, or when reading metadata.

That makes the compaction synchronous again, i.e., the db_write hook will get locked out from the file until the compaction is done.

That is incorrect, above comment should be read more precise and look at code inside the with self.lock: context.

lock the backup file when writing data, https://github.com/lightningd/plugins/blob/df1c96ade1709e31d371a5be4b4579293f656c65/backup/filebackend.py#L76-L87

, or when reading metadata. https://github.com/lightningd/plugins/blob/df1c96ade1709e31d371a5be4b4579293f656c65/backup/filebackend.py#L171-L175

But I forgot to mention the last part where it also locks:

  • when adding catch-up changes to the clone that where added (to the backup) while compacting
  • during atomic swap https://github.com/lightningd/plugins/blob/df1c96ade1709e31d371a5be4b4579293f656c65/backup/filebackend.py#L228-L235

Though maybe the compaction must be split into two steps

I think it is already about what you suggest here :-)

happy to merge as is to unblock others waiting on this ^^

As you wish, I already run this on my own node. But I was a little worried about users running current backup.py on v0.10.2 pre #4945 maybe getting backup is out of date errors but haven't seen these yet. The ones in issue 4994 seem causes by other reasons (power failure).

SimonVrouwe avatar Jan 09 '22 15:01 SimonVrouwe

Rebased on latest master (nothing else). There always appears to be some problem with the testing system, this time a network error?

Collecting pip
  WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01e6d0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl
  WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01e820>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl
  WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01e970>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl
  WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01eac0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl
  WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01ec10>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl
ERROR: Could not install packages due to an EnvironmentError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Max retries exceeded with url: /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01ed60>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

Error: Process completed with exit code 1.

SimonVrouwe avatar Feb 03 '22 10:02 SimonVrouwe

Needs rebase, CI should be working now

chrisguida avatar Feb 07 '24 16:02 chrisguida