lnd
lnd copied to clipboard
multi: log additional information for local force closures
Change Description
Follow up to: https://github.com/lightningnetwork/lnd/pull/7787
Fixes: #7696
cc: @ziggie1984 for review
@Chinwendu20 can you please fix all the linter issues first: make lint
or golangci-lint run
I am aware there is a lint issue this is because the PR is dependent on this https://github.com/lightningnetwork/lnd/pull/8057 so I added a replace directive for tlv temporarily when a new tag for tlv is released and it is bumped in lnd, I will remove it. This is just for review sake. I do not know if there are any other lint issues, make lint or golangci-lint run times out on my system even when using multiple cores but I see no lint issues on the CI the workflow has not been approved, I would soon fix the file conflicts as well but hopefully that should not get in the way of reviews.
I thought about something like this:
type OptionalForceCloseInfo struct {
userInitiated *bool
linkFailureError *string
htlcActions map[string][]HTLC
}
func (f *OptionalForceCloseInfo) Apply(optionalForceCloseFields ...OptionalForceCloseField) {
for _, optionalForceCloseField := range optionalForceCloseFields {
optionalForceCloseField(f)
}
}
=> add the OptionalForceCloseInfo to the ChanArbitrator
type OptionalForceCloseField func(*OptionalForceCloseInfo)
func UserInitiated(userInitiated bool) OptionalForceCloseField {
return func(f *OptionalForceCloseInfo) {
f.userInitiated = &userInitiated
}
}
func LinkFailureError(linkFailureError string) OptionalForceCloseField {
return func(f *OptionalForceCloseInfo) {
f.linkFailureError = &linkFailureError
}
}
record the infos you want to add where necessary ?
....
func (b *boltArbitratorLog) LogLocalForceCloseInfo(
info channeldb.LocalForceCloseInfo)
//Log the infos when the channel is closed ....
...
}
Log infos when the channel close transaction is broadcasted and provide the information not only when channels are completely closed but also in pendingclose like for example channels which are in waiting_close_channels
or pending_force_closing_channels
?
This approach would reduce the db operations to only one when we combine the ForceClose infos.
Let me know what you think or whether this is helpful, if you need a more detailed picture I can provide it as well.
Log infos when the channel is broadcasted
Do you mean when the channel is marked as open? @ziggie1984
Do you mean when the channel is marked as open?
I updated the comment, was referring to the broadcasting of the channel close transaction.
I am aware there is a lint issue this is because the PR is dependent on this https://github.com/lightningnetwork/lnd/pull/8057 so I added a replace directive for tlv temporarily when a new tag for tlv is released and it is bumped in lnd, I will remove it. This is just for review sake. I do not know if there are any other lint issues, make lint or golangci-lint run times out on my system even when using multiple cores but I see no lint issues on the CI the workflow has not been approved, I would soon fix the file conflicts as well but hopefully that should not get in the way of reviews.
FYI you can create a PR on your fork of the lnd repo in github and then the CI tests will run without you needing permission every time
Took a look at the PR (first commit only for now), good work 🙌
Definitely a ConceptAck. In terms of the approach I think we can do even better especially in terms of future enhancements.
Overall I think we can already combine the infos when a force-close happens meaning that we could always add the HTLC action map, why restrict it to closures which are explictily triggered by the
chaintrigger
. Meaning that why not add the detailed state of the HTLCs when the user or any linkfailure triggered the FC?
Are you saying that beyond indicating that a force close was caused by a user using bool
and indicating the link failure error when the cause of the force close is a link failure, the localForceCloseInfo
should also provide the htlcs associated with the channel as well?
Are you saying that beyond indicating that a force close was caused by a user using bool and indicating the link failure error when the cause of the force close is a link failure, the localForceCloseInfo should also provide the htlcs associated with the channel as well?
Exactly why not record this info as well, when it's there, it could also paint a more detailed picture when a link_error happens for whatever reason ? Currently when the user manually force closes or the link fails the gained info is very low , adding the htlc action map could reveal some potential problems in hindsight? Example: User cannot co-op close a channel because of an unclean channel state, now he does not want to wait until an htlc times-out but rather goes for the manual force close of the channel. Let's say there was a remote dangling htlc, one would not see this in the resolutions which are already existent in the close_summary.
Moreover it would also give use the possibility to enhance the information we show when channels are in a pending_state
later on. Wdyt ?
Are you saying that beyond indicating that a force close was caused by a user using bool and indicating the link failure error when the cause of the force close is a link failure, the localForceCloseInfo should also provide the htlcs associated with the channel as well?
Exactly why not record this info as well, when it's there, it could also paint a more detailed picture when a link_error happens for whatever reason ? Currently when the user manually force closes or the link fails the gained info is very low , adding the htlc action map could reveal some potential problems in hindsight? Example: User cannot co-op close a channel because of an unclean channel state, now he does not want to wait until an htlc times-out but rather goes for the manual force close of the channel. Let's say there was a remote dangling htlc, one would not see this in the resolutions which are already existent in the close_summary.
Moreover it would also give use the possibility to enhance the information we show when channels are in a
pending_state
later on. Wdyt ?
Are you saying that beyond indicating that a force close was caused by a user using bool and indicating the link failure error when the cause of the force close is a link failure, the localForceCloseInfo should also provide the htlcs associated with the channel as well?
Exactly why not record this info as well, when it's there, it could also paint a more detailed picture when a link_error happens for whatever reason ? Currently when the user manually force closes or the link fails the gained info is very low , adding the htlc action map could reveal some potential problems in hindsight? Example: User cannot co-op close a channel because of an unclean channel state, now he does not want to wait until an htlc times-out but rather goes for the manual force close of the channel. Let's say there was a remote dangling htlc, one would not see this in the resolutions which are already existent in the close_summary.
Moreover it would also give use the possibility to enhance the information we show when channels are in a
pending_state
later on. Wdyt ?
I would see about implementing this.
Another point which we should add to the FC Info is the
broadcast height
of the FC, because that would fill the HTLC action map with more life under the condition that we add more infos of the HTLC in the summary other than just the Hash (mainly the timeout blockheights)
Please confirm if this same as the broadcast height
you refer to. It is present in the closed channel response already:
https://github.com/lightningnetwork/lnd/blob/e3c10c110f9d4946ea62aebb854c0eddc28fc888/lnrpc/lightning.proto#L1649-L1650
Please confirm if this same as the broadcast height you refer to. It is present in the closed channel response already:
Not quite, this is the height when the channel close transaction was actually confirmed in the blockchain, I was referring to the height the close tx was broadcasted to the network, because this is the height we are most interested in when we would analyse the HTLC Action Map in the feature. The spent can sometimes happen several blocks later, especially for STATIC_REMOTE Channels.
Now when we combine different infos, we see clearly that the current approach would always require a new DB transaction.
Could you please explain more about this? How would we always require a new DB transaction? Adding your new suggestion of including htlcs and broadcast height to this struct. I would add a new struct perhaps named, ForceCloseInfoSummary
:
type ForceCloseInfoSummary {
HTLCActionMap map[string][]HTLC
BroadcastHeight int64
}
type OptionalForceCloseInfo struct {
userInitiated *bool
linkFailureError *string
chainAction *bool
summary ForceCloseInfoSummary
}
Of userinitiated
, linkFailureError
and chainAction
only one would be present in the struct and consequently in the rpc response. So the way I envision it, there would be DB transactions.
- To log whether the force is due to
userinitiated
,linkFailureError
orchainAction
- Log the htlcs for
user trigger
.
(For chainAction
these two steps could both be one as they are both in the same site. We would know the force close is due to a chainAction in the same site we would log the htlcs unlike others)
- Log the broadcast height for closetx.
This is the design that I am looking at what is your thought?
I do not 100% understand your proposal do you have maybe already something to look at in your local repo? Happy to look into your approach if you have something up.
I thought I layout my thoughts in more detail in the following (maybe it helps):
So I contemplated about this PR maybe some thoughts which might be helpful:
-
Because we cannot log the force-close info all in one place without passing arguments back and forth in function I think your initial idea in always logging the info where needed makes sense, otherwise we woud need to introduce some variable and mutex along side it which IMO is more complex that just hitting the db couple of times. Moreover we are talking about a code part which is only used when a channel is force closed, so adding more db-transaction shouldn't be a problem. The current approach has some downside, because it will always overwrite the old bucket_entry as soon as we try to log other information. If we want to make the local_force_close info more flexible we should make sure we can log parts of the force-close info more than once without deleting the previous entry. We can do this the following way:
-
Create a Nested Bucket for the LocalForceClose Info and for every different info we introduce a different key
-
Or before writing into the logs we check if we have something already available and only update the other part.
I tend to go with 2, because then we can reuse the serialization and deserialization already implemented ?
Based on choice 2, part of the code would look like this:
channeldb.go
type OptionalForceCloseInfo struct {
// userInitiated describes whether a close was triggered by the user.
userInitiated bool
// linkFailureError ...
linkFailureError string
htlcActions map[string][]HTLC
broadcastHeight uint64
}
func (f *OptionalForceCloseInfo) Apply(optionalForceCloseFields ...OptionalForceCloseField) {
for _, optionalForceCloseField := range optionalForceCloseFields {
optionalForceCloseField(f)
}
}
type OptionalForceCloseField func(*OptionalForceCloseInfo)
func UserInitiated() OptionalForceCloseField {
return func(f *OptionalForceCloseInfo) {
f.userInitiated = true
}
}
func LinkFailureError(linkFailureError string) OptionalForceCloseField {
return func(f *OptionalForceCloseInfo) {
f.linkFailureError = linkFailureError
}
}
func HtlcActions(htlcActions map[string][]HTLC) OptionalForceCloseField {
return func(f *OptionalForceCloseInfo) {
f.htlcActions = htlcActions
}
}
func BroadCastHeight(broadcastHeight uint64) OptionalForceCloseField {
return func(f *OptionalForceCloseInfo) {
f.broadcastHeight = broadcastHeight
}
}
briefcase.go
func (b *boltArbitratorLog) LogLocalForceCloseInfo(
info ...channeldb.OptionalForceCloseField) error {
return kvdb.Update(b.db, func(tx kvdb.RwTx) error {
scopeBucket, err := tx.CreateTopLevelBucket(b.scopeKey[:])
if err != nil {
return err
}
// Check if we already have a value available.
infoBytes := scopeBucket.Get(localForceCloseInfoKey)
var forceCloseInfo channeldb.OptionalForceCloseInfo
if infoBytes != nil {
infoReader := bytes.NewReader(infoBytes)
forceCloseInfo, err = channeldb.DeserializeLocalCloseInfo(
infoReader)
if err != nil {
return err
}
}
forceCloseInfo.Apply(info...)
var buf bytes.Buffer
if err := channeldb.SerializeLocalCloseInfo(&buf, &forceCloseInfo); err != nil {
return err
}
return scopeBucket.Put(localForceCloseInfoKey, buf.Bytes())
}, func() {})
}
...
Then call this function with the appropriate Infos where appropriate:
Example:
channel_arbitrator.go
// If this is a chain trigger, then we'll go straight to the
// next state, as we still need to broadcast the commitment
// transaction.
case chainTrigger:
htlcMap := make(map[string][]channeldb.HTLC)
for action, htlcs := range chainActions {
htlcMap[action.String()] = htlcs
}
c.log.LogLocalForceCloseInfo(
channeldb.HtlcActions(htlcMap),
channeldb.BroadCastHeight(uint64(triggerHeight)))
fallthrough
case userTrigger:
nextState = StateBroadcastCommit
chain_arbitrator.go
func (c *ChainArbitrator) ForceCloseContract(chanPoint wire.OutPoint,
opt ...channeldb.OptionalForceCloseField) (*wire.MsgTx, error) {
c.Lock()
arbitrator, ok := c.activeChannels[chanPoint]
c.Unlock()
if !ok {
return nil, fmt.Errorf("unable to find arbitrator")
}
log.Infof("Attempting to force close ChannelPoint(%v)", chanPoint)
arbitrator.log.LogLocalForceCloseInfo(opt...)
....
and the calls in for example rpcserver.go:
closingTx, err := chainArbitrator.ForceCloseContract(
*chanPoint,
channeldb.UserInitiated(),
)
Thanks a lot for the explanation @ziggie1984
I do not 100% understand your proposal do you have maybe already something to look at in your local repo? Happy to look into your approach if you have something up.
I thought I layout my thoughts in more detail in the following (maybe it helps):
So I contemplated about this PR maybe some thoughts which might be helpful:
- Because we cannot log the force-close info all in one place without passing arguments back and forth in function I think your initial idea in always logging the info where needed makes sense, otherwise we woud need to introduce some variable and mutex along side it which IMO is more complex that just hitting the db couple of times. Moreover we are talking about a code part which is only used when a channel is force closed, so adding more db-transaction shouldn't be a problem. The current approach has some downside, because it will always overwrite the old bucket_entry as soon as we try to log other information. If we want to make the local_force_close info more flexible we should make sure we can log parts of the force-close info more than once without deleting the previous entry. We can do this the following way:
- Create a Nested Bucket for the LocalForceClose Info and for every different info we introduce a different key
- Or before writing into the logs we check if we have something already available and only update the other part.
I tend to go with 2, because then we can reuse the serialization and deserialization already implemented ?
Based on choice 2, part of the code would look like this:
channeldb.go
type OptionalForceCloseInfo struct { // userInitiated describes whether a close was triggered by the user. userInitiated bool // linkFailureError ... linkFailureError string htlcActions map[string][]HTLC broadcastHeight uint64 }
I agree with this except from the structure of the struct, we would still have to indicate if a force close was due to chain actions.
func (f *OptionalForceCloseInfo) Apply(optionalForceCloseFields ...OptionalForceCloseField) { for _, optionalForceCloseField := range optionalForceCloseFields { optionalForceCloseField(f) } }
It is not clear to me how this would be used, since there would only be one OptionalForceCloseInfo
any way as the force close of a channel only happens once, would this be necessary?
This is one of the things I talked about in my previous comment as I did not understand what you meant about taking another approach to reduce DB transactions.
type OptionalForceCloseField func(*OptionalForceCloseInfo)
func UserInitiated() OptionalForceCloseField { return func(f *OptionalForceCloseInfo) { f.userInitiated = true } }
func LinkFailureError(linkFailureError string) OptionalForceCloseField { return func(f *OptionalForceCloseInfo) { f.linkFailureError = linkFailureError } }
func HtlcActions(htlcActions map[string][]HTLC) OptionalForceCloseField { return func(f *OptionalForceCloseInfo) { f.htlcActions = htlcActions } }
func BroadCastHeight(broadcastHeight uint64) OptionalForceCloseField { return func(f *OptionalForceCloseInfo) { f.broadcastHeight = broadcastHeight } }
briefcase.go
func (b *boltArbitratorLog) LogLocalForceCloseInfo( info ...channeldb.OptionalForceCloseField) error {
return kvdb.Update(b.db, func(tx kvdb.RwTx) error { scopeBucket, err := tx.CreateTopLevelBucket(b.scopeKey[:]) if err != nil { return err }
// Check if we already have a value available. infoBytes := scopeBucket.Get(localForceCloseInfoKey) var forceCloseInfo channeldb.OptionalForceCloseInfo if infoBytes != nil { infoReader := bytes.NewReader(infoBytes) forceCloseInfo, err = channeldb.DeserializeLocalCloseInfo( infoReader) if err != nil { return err } } forceCloseInfo.Apply(info...) var buf bytes.Buffer if err := channeldb.SerializeLocalCloseInfo(&buf, &forceCloseInfo); err != nil { return err } return scopeBucket.Put(localForceCloseInfoKey, buf.Bytes())
}, func() {}) }
...
Then call this function with the appropriate Infos where appropriate: Example: channel_arbitrator.go
// If this is a chain trigger, then we'll go straight to the // next state, as we still need to broadcast the commitment // transaction. case chainTrigger: htlcMap := make(map[string][]channeldb.HTLC) for action, htlcs := range chainActions { htlcMap[action.String()] = htlcs } c.log.LogLocalForceCloseInfo( channeldb.HtlcActions(htlcMap), channeldb.BroadCastHeight(uint64(triggerHeight))) fallthrough
One of the things I was trying to pass by you in my previous comment is that we might me adding a function here as well to log htlc due to userTrigger.
case userTrigger: nextState = StateBroadcastCommit
I was trying to understand what you were saying about this current approach needing multiple DB transactions when we include htlcActions for userTrigger and broadcast height as part of the info. I was just pointing out the number of times we would be logging into the localInfo store:
- To log whether the force is due to userinitiated, linkFailureError or chainAction
- Log the htlcs for user trigger or chaintrigger (For chainAction these two steps could both be one as they are both in the same site. We would know the force close is due to a chainAction in the same site we would log the htlcs unlike others)
3.Log the broadcast height for closets.
All these happen at different sites and that is why we cannot log multiple parts of the info at the same time.
Ok understand. So how do you wanna make sure you do not overwrite Infos you saved already into the logs in a prior state ? Because the current approach flushes the LocalForceCloseInfo
everytime its called as a whole?
I think you mentioned before that there are possibilities were you can log more then one info at a time, for example the chainActions + broadcastHeight (it's not exactly the height were the tx is published, but the point in time were we decided to switch to the from the default value.)
Thats why I came up with this design:
c.log.LogLocalForceCloseInfo(
channeldb.HtlcActions(htlcMap),
channeldb.BroadCastHeight(uint64(triggerHeight)))
Only an idea from my side, feel free to follow your ideas.
Yeah I think the chainAction boolean makes sense besides the other ones.
Ok understand. So how do you wanna make sure you do not overwrite Infos you saved already into the logs in a prior state ? Because the current approach flushes the
LocalForceCloseInfo
everytime its called as a whole?
I think I would explore your idea of creating a different key for each info.Then see how it goes.
I think you mentioned before that there are possibilities were you can log more then one info at a time, for example the chainActions + broadcastHeight (it's not exactly the height were the tx is published, but the point in time were we decided to
chainActions Bool +_ HTLC map
switch to the from the default value.)
Thats why I came up with this design:
c.log.LogLocalForceCloseInfo( channeldb.HtlcActions(htlcMap), channeldb.BroadCastHeight(uint64(triggerHeight)))
Only an idea from my side, feel free to follow your ideas.
Yeah I think the chainAction boolean makes sense besides the other ones.
Thanks a lot,I think I would start working on the PR now then get more feedback from you. I just wanted to be sure that am not completely off track. Thanks a lot.
Hello @ziggie1984 this is ready for review
Would be nice to view the rpcserver logs on the CI to help debug itests. This particular failing test works locally.
Would be nice to view the rpcserver logs on the CI to help debug itests. This particular failing test works locally.
You can just download the zip-file for the logs and look at the particular tests which are failing and analyse the corresponding log-files.
Just see the details of one tests => then click on summary on the top left corner and scroll down to the zip-files. Then pick your related test which is failing.
Hello @ziggie1984, I have fixed the failing tests, thanks a lot.
Nice follow-up work on this PR. I reviewed this PR and I think we somehow introduce too much complexity with this approach. Especially this optional function pattern type ForceCloseContractOption func(*ChannelArbitrator)
is not the best pattern to tackle this problem (IMO). Normally this pattern is used to provide optional configuration to structs (see https://golang.cafe/blog/golang-functional-options-pattern.html). In this PR however we are only using it as a mean of accessing the ArbitratorLog and I think thats not the best way wdyt and makes it very unflexible. I think the best way would be to make it more stream-lined, and working with arguemnts instead of the functional type wdyt ?
The struct of the FCInfo can mostly be the same something like:
type FCInfo struct {
// The reason why an FC was force-closed.
Reason FCReason
// ChainActions holds the HTLC action chain map associated with the
// local force close.
ChainActions ChainActionMap
// BroadcastHeight is the height in which the close transaction
// associated force close is broadcasted.
BroadcastHeight uint32
// AdditionalInfo holds additional information like the detailed link
// error which caused the channel to force close.
AdditionalInfo string
}
Then the interface which logs all this different types in a kv db which can be a subinterface of the ArbitratorLog:
type FCInfoLog interface {
LogFCReason(r FCReason) error
LogChainActions(c ChainActionMap) error
LogBroadcastHeight(h uint32) error
LogAdditionalInfo(s string) error
FetchFCInfo() (FCInfo, error)
}
type FCReason uint8
const (
User = iota
Chain
LinkError
)
And then we would just call this function where applicable for example:
func (c *ChainArbitrator) ForceCloseContract(chanPoint wire.OutPoint, r FCReason,
additionalInfo string) (*wire.MsgTx, error) {
c.log.LogFCReason()
c.log.LogAdditionalInfo()
In the stateStep function it would look like this:
case chainTrigger:
c.log.updateFCReason(Chain)
c.log.CommitActions(c)
nextState = StateBroadcastCommit
case userTrigger:
nextState = StateBroadcastCommit
When the channel is closed just fetch the db information rather than adding a new element to the OpenChannel
struct, which is never used but to populate the ChannelCloseSummary therefore this information can just be fetched directly ?
func (c *OpenChannel) CloseChannel(summary *ChannelCloseSummary,
statuses ...ChannelStatus) error {
....
fcInfo, err := FetchFCInfo()
summary.FCInfo = fcInfo
Keep in mind this is just my opinion, I think it would be great to have a more experienced developer take a look at this and decide whether we are fine with the current design. If the direction is the way we wanna go I will review this PR in more detail (@saubyk)
Thanks @ziggie1984 the complexity you speak of was not just introduced in this follow up PR. Same approach the original PR author used: https://github.com/lightningnetwork/lnd/pull/7787
The functional option you pointed at was applauded by @guggero in the previous review: https://github.com/lightningnetwork/lnd/pull/7787/commits/5b5e44d31168f409198c2afbaeef1b213f655871#r1317203180
The aim of this follow PR was just to store data in tlv format, as per this review comment: https://github.com/lightningnetwork/lnd/pull/7787#pullrequestreview-1613203780
But I went ahead to add adding info for theforce close as suggested by you, though the original author highlighted that that could be addressed in follow up PRs.
Nice follow-up work on this PR. I reviewed this PR and I think we somehow introduce too much complexity with this approach. Especially this optional function pattern
type ForceCloseContractOption func(*ChannelArbitrator)
is not the best pattern to tackle this problem (IMO). Normally this pattern is used to provide optional configuration to structs (see https://golang.cafe/blog/golang-functional-options-pattern.html). In this PR however we are only using it as a mean of accessing the ArbitratorLog and I think thats not the best way wdyt and makes it very unflexible. I think the best way would be to make it more stream-lined, and working with arguemnts instead of the functional type wdyt ?The struct of the FCInfo can mostly be the same something like:
type FCInfo struct { // The reason why an FC was force-closed. Reason FCReason // ChainActions holds the HTLC action chain map associated with the // local force close. ChainActions ChainActionMap // BroadcastHeight is the height in which the close transaction // associated force close is broadcasted. BroadcastHeight uint32 // AdditionalInfo holds additional information like the detailed link // error which caused the channel to force close. AdditionalInfo string }
Then the interface which logs all this different types in a kv db which can be a subinterface of the ArbitratorLog:
type FCInfoLog interface { LogFCReason(r FCReason) error LogChainActions(c ChainActionMap) error LogBroadcastHeight(h uint32) error LogAdditionalInfo(s string) error FetchFCInfo() (FCInfo, error) }
It is unclear to how this interface would be useful.
type FCReason uint8 const ( User = iota Chain LinkError )
This was my initial approach but because of the various link failure errors, I decided to just represent with a string instead of going the route of logging additional infos because it is not additional infos per se, all the link failure errors as still reasons so why not represent them as that instead of just link failure errors.
And then we would just call this function where applicable for example:
func (c *ChainArbitrator) ForceCloseContract(chanPoint wire.OutPoint, r FCReason, additionalInfo string) (*wire.MsgTx, error) {
c.log.LogFCReason() c.log.LogAdditionalInfo()
In the stateStep function it would look like this:
case chainTrigger: c.log.updateFCReason(Chain) c.log.CommitActions(c) nextState = StateBroadcastCommit case userTrigger: nextState = StateBroadcastCommit
When the channel is closed just fetch the db information rather than adding a new element to the `OpenChannel` struct, which is never used but to populate the ChannelCloseSummary therefore this information can just be fetched directly ?
Okay I added an extra field to waiting close because talked about making the local force close info available for waiting close channel: https://github.com/lightningnetwork/lnd/pull/8072#issuecomment-1761470662. Waiting close channels are still stored in same bucket as open channels, while fetching it we need a place to store it before converting to lnrpc format just like other fields in the OpenChannel
struct.
func (c *OpenChannel) CloseChannel(summary *ChannelCloseSummary, statuses ...ChannelStatus) error { .... fcInfo, err := FetchFCInfo() summary.FCInfo = fcInfo
Keep in mind this is just my opinion, I think it would be great to have a more experienced developer take a look at this and decide whether we are fine with the current design. If the direction is the way we wanna go I will review this PR in more detail (@saubyk)
First and foremost thanks for the hard work you put into the PR and especially fixing all the itests, which can be very tiresome when they pass locally 💪. In terms of how we wanna proceed with this PR form my point of view. I definitely think it makes sense to split this PR in smaller chunks, making it way easier to oversee all the potential side effects which can be introduced when for example changing the OpenChannel
struct, which should be done very carefully because it touches a lot of areas so being cautious makes sense. I addition all the work you put on top of the initial PR can be reused especially the tlv logic you introduced, there this work will be valuable.
Just a small example I stumbled upon when playing around with this PR and the side effects of changing something in the OpenChannel
struct, it must be done very carefully.
Currently the setup always prints force close insights information although the channels have never been force-closed.
2023-11-28 12:20:22.918 [WRN] CHDB: unable to fetch local force insights: no local force close insight found
2023-11-28 12:20:22.919 [WRN] CHDB: unable to fetch local force insights: no local force close insight found
Moreover I think we can agree that it is currently very difficult to reuse data types for example the type ChainActionMap
which is an important question to solve before introducing almost identical types. A lot of boltdb logic lives currently in the briefcase.go
file so we need to find the best way how to come up with a nice design without making the codebase even more complex.
So I think the following questions answered should bring us to a better design overall:
- Decide whether we want to first refactor and pull everything related to the boltdb out of the
briefcase.go
logic into a new file in thechanneldb
dir this will make it easy to reuse types for the upcoming change and make the switch to the sql db smooth. - Which information do we want to persist. I think the
LinkError
(the detailed error string), the reason for the force close (LinkError, Userinitiated or ChainInitiated) and the ChainAction map are three different things to look at. In my opinion it makes sense to add aLinkError
field into theOpenChannel
struct, because we do currently not always FC when there is an unrecoverable LinkError, so users would just see an inactive channel and would need to look through their logs to actually find out that a LinkErrror occurred. The reason for a FC can be separated form the link error and can just be a simple enum type with currently 3 options (LinkError, Userinitiated or ChainInitiated). And the peristing of the ChainActions can be separated as well. This could guide how we want to split this PR maybe? - Maybe first show the FCInfo data only for already closed channels. (not the waiting close ones) to reduce complexity.
- Add this information also to the waiting_close_channel part as a final point.
Happy to hear other developers opinions :), keep up the good work 🤝 @Chinwendu20
First and foremost, I applaud your attempt to review this PR. Kindly confirm that you are not playing around with an old version of this PR as per your example on a flaw of modifying the OpenChannel
struct. I have already addressed the issue, there is currently no part of this PR that logs that but point me to it if I somehow I missed it.
Even if that were the case, it is not a pointer that the OpenChannel
struct should not be modified but rather that the particular situation of logging a "local insight not found" should be handled properly.
I do not think that I am duplicating the ChainAction
map type in this PR as well.
I think you can open a separate issue if you have suggestions on refactoring the briefcase.go. I personally do not think it is needed for this particular PR as per the reasons that you stated but this discussion would be much more appropriate in a separate issue.
Your no 2 point. I missed the part where it is a better advantage to state that an FC was due to a link failure error specifically as opposed to just stating the particular link failure error itself as this PR does now. Also, do you have issues with how this PR is split?
points 3 and 4. What complexity are you referring to?
I do not understand your no 4 point.
Nice attempt at the review @ziggie1984 but I think going through the original PR and the review on it then going through my previous comments would help with some of your concerns.
assigned to @ProofOfKeags for additional approach ack here
Alright. Let's zoom out for a second and remind ourselves why this is important.
All of this info is being captured for the express purpose of debugging unexpected force-closures. Unexpected force-closures can happen for two reasons:
- The remote peer force closed the channel and we saw it show up on chain.
- We automatically force closed the channel due to a bug in our own logic and we want to capture a lot of that context to figure out wtf went wrong.
Note that the "user initiated" case is fairly irrelevant. Yes we will want to capture that just because it is good to know but it isn't the primary function of this work. In both of the other cases we want to know what our role in the force close is. Doing this for the remote close case is probably intractable without some system for detailed protocol traces and even then there are still reasons the other party might try and force close. This leaves the case of the LinkFailureError. This is the most important case. Any design approach should optimize for making sure we captured as detailed of information we can in this case.
~~Keeping this in mind, I'm in full agreement with @ziggie1984 that we should be capturing the ChainActionMap for the LinkError case. Secondly, because this is simply to give ourselves as much evidence as possible, we need not prioritize atomicity of all of the information. We'd rather get something than nothing, even if that something is incomplete. So we can break up the different "logs" if it helps ensure we get more information more of the time.~~
~~To weigh in on @ziggie1984 's 4 questions:~~
~~1. I think that extracting out all of the DB stuff from the briefcase.go
and put it into channeldb
is correct.~~
~~2. I think that that those 3 pieces of information are great, with the greatest emphasis being placed on the LinkError case.~~
~~3. Yeah this would be a good first step. The goal is to get a good autopsy, we don't necessarily need that autopsy fast, but we do need it to be comprehensive.~~
~~4. Once we have the first 3 things, I say go for it.~~
~~Concept ACK, but I think the Approach needs to be reworked a bit to move the types that require db persistence into the channeldb package.~~
Please disregard. I misread a lot of this PR and need to go back and look through it more carefully.
Alright. Let's zoom out for a second and remind ourselves why this is important.
All of this info is being captured for the express purpose of debugging unexpected force-closures. Unexpected force-closures can happen for two reasons:
- The remote peer force closed the channel and we saw it show up on chain.
- We automatically force closed the channel due to a bug in our own logic and we want to capture a lot of that context to figure out wtf went wrong.
Note that the "user initiated" case is fairly irrelevant. Yes we will want to capture that just because it is good to know but it isn't the primary function of this work. In both of the other cases we want to know what our role in the force close is. Doing this for the remote close case is probably intractable without some system for detailed protocol traces and even then there are still reasons the other party might try and force close. This leaves the case of the LinkFailureError. This is the most important case. Any design approach should optimize for making sure we captured as detailed of information we can in this case.
Keeping this in mind, I'm in full agreement with @ziggie1984 that we should be capturing the ChainActionMap for the LinkError case.
Thanks but this PR already does that already it captures the ChainActionMap for All cases.
Secondly, because this is simply to give ourselves as much evidence as possible, we need not prioritize atomicity of all of the information. We'd rather get something than nothing, even if that something is incomplete. So we can break up the different "logs" if it helps ensure we get more information more of the time.
Can you please explain this a bit more? What I understand is that you want us to break down all the info that we are getting (I do not why that should be a priority) but this PR does that already. The reasons for initiating the force close, the chain action, broadcast height are all logged separately.
To weigh in on @ziggie1984 's 4 questions:
- I think that extracting out all of the DB stuff from the
briefcase.go
and put it intochanneldb
is correct.
Do you mean this DB stuff: https://github.com/lightningnetwork/lnd/blob/3906a44f17abfb4b7ffb3963de6515accec89017/contractcourt/briefcase.go#L1174-L1207
- I think that that those 3 pieces of information are great, with the greatest emphasis being placed on the LinkError case. Okay there is no concern to address here then.
- Yeah this would be a good first step. The goal is to get a good autopsy, we don't necessarily need that autopsy fast, but we do need it to be comprehensive.
- Once we have the first 3 things, I say go for it.
Concept ACK, but I think the Approach needs to be reworked a bit to move the types that require db persistence into the channeldb package.
Thanks but this PR already does that already it captures the ChainActionMap for All cases.
My mistake. I got mixed up catching up on the conversation vs the code side. It appears you are correct.
Can you please explain this a bit more? What I understand is that you want us to break down all the info that we are getting (I do not why that should be a priority) but this PR does that already. The reasons for initiating the force close, the chain action, broadcast height are all logged separately.
This is more a statement for clarity, not a specific action request item. I was asked to come in and talk about the overall approach and I am just trying to say with this statement that if we log these different aspects independently, that's fine.
It appears that I need to take a closer look at this change because I can see you are right from the code and your comment @Chinwendu20. I got mixed up between the conversation and the most up-to-date code.