Potential RaceCondition in CS-LPC and CS-LPP in WriteApprovalCallbacks
When a Feature for a LocalEntity has multiple WriteApprovalCallbacks (e.g. LPC+LPP) and a command that requires write approval is sent for one of these features, depending on the order in which the WriteApprovalCallbacks are executed a nil pointer dereference can occur causing a segmentation fault.
In general the segmentation fault occurs when the ApproveOrDenyWrite method is called on a ship-go FeatureLocalInterface, the passed *Message is not complete (i.e. contains only a valid MsgCounter) and "completes" the number of required approvals such that processWrite is then called with an incomplete *Message.
One potential fix for this issue would be splitting the public ApproveOrDenyProductionLimit messages into a public and private part where the private part directly takes a *shipapi.Message which it can pass to ApproveOrDenyWrite and can be called from the WriteApprovalCallback. Another potential fix would be to adapt ApproveOrDenyWrite in ship-go to cache the "most valid" message and pass that through to processWrite instead of passing the last message.
If we decide on the first option, we should definitely document the API contract for ApproveOrDenyWrite that it only accepts "valid" messages and may crash when passed incomplete Messages.
Backtrace
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x8ca891]
goroutine 181 [running]:
github.com/enbility/spine-go/spine.(*FeatureLocal).processWrite(0xc000000300, 0xc000632008)
/home/simon/projects/intern/02_eebus/eebus-servers/vendor/github.com/enbility/spine-go/spine/feature_local.go:674 +0xb1
github.com/enbility/spine-go/spine.(*FeatureLocal).ApproveOrDenyWrite(0xc000000300, 0xc000632008, {0x0, 0x0})
/home/simon/projects/intern/02_eebus/eebus-servers/vendor/github.com/enbility/spine-go/spine/feature_local.go:236 +0x3e5
github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).ApproveOrDenyProductionLimit(0xc0002b2cc0, 0x128, 0x1, {0x0, 0x0})
/home/simon/projects/intern/02_eebus/eebus-servers/vendor/github.com/enbility/eebus-go/usecases/cs/lpp/public.go:151 +0x246
created by github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).loadControlWriteCB in goroutine 122
/home/simon/projects/intern/02_eebus/eebus-servers/vendor/github.com/enbility/eebus-go/usecases/cs/lpp/usecase.go:148 +0x325
Illustrative Example
- A device registers the usecases cs/lpc and cs/lpp which both register their loadControlWriteCBs.
- The device receives an LPC command and the WriteApprovalCallbacks for both use cases are entered in separate goroutines "simultaneously" (LPC, LPP)
- The LPC handler proceeds through line 140 and calls the EventCB registered on the LPC use-case.
- The LPC EventCB calls ApproveOrDenyConsumptionLimit which passes through to line 159 and calls ApproveOrDenyWrite
- ApproveOrDenyWrite runs through until line 215 where since it's the first handler out of two handlers total it sets
r.writeApprovalReceived[*msg.RequestHeader.MsgCounter] = 1and returns (1 < 2 line 222) - The LPP handler then proceeds through to line 148 because limitId != item.LimitId for all entries in the LoadControlLimitData and calls ApproveOrDenyProductionLimit with the MsgCounter (and only the MsgCounter of the message)
- ApproveOrDenyProductionLimit continues until line 143 where it enters the !ok branch because there are no pending LPP limits.
- ApproveOrDenyProductionLimit then creates a new spineapi.Message which only has RequestHeader.MsgCounter set (all other fields are nil) and calls ApproveOrDenyWrite
- ApproveOrDenyWrite continues until line 215 sets
r.writeApprovalReceived[*msg.RequestHeader.MsgCounter] = 1 + 1passes through the check on line 222 (2 >= 2) and continues until line 236 to call processWrite - processWrite calls executeWrite which calls
msg.Cmd.Data()which returnserrors.New("Data not found in Cmd")because Cmd is zero-initialized and all its fields are nil. - executeWrite then returns
model.NewErrorType(model.ErrorNumberTypeCommandNotSupported, err.Error())and execution continues in processWrite on line 674 with a call tomsg.FeatureRemote.Device().Sender().ResultError(msg.RequestHeader, r.Address(), err)which proceeds to segfault on calling the Device() method of the FeatureRemote interface which was zero-initialized to nil.
This segfault also occurs when the LPP handler completes first for an LPP command (then segfaulting during the LPC-handler). The segfault does not occur when the LPC handler completes second for an LPC command or the LPP handler completes second for an LPP command as then the actual message is retrieved from the pendingLimits map and passed to ApproveOrDenyWrite.
The simplest way to reproduce this is to replace lpp/usecase.go line 148 with
go func() {
time.Sleep(1 * time.Second)
e.ApproveOrDenyProductionLimit(*msg.RequestHeader.MsgCounter, true, "")
}()
And then send an LPC command while both LPC and LPP are active. This will guarantee that the LPP-Handler is called after the LPC-Handler for all commands (which will segfault for LPC commands).
I've dropped an implementation of the first option in #105
Thank you @sthelen-enqs for this tremendous bug report and the PR! This looks awesome.
I'll try to look into it as soon as possible.
@sthelen-enqs, can you provide the affected eebus-go and ship revisions? I don't see this behavior. If you like we can also have a debug session to check the differences between our implementations. ;)
Greetings from Ulm
I could reproduce the issue with at least the current versions (github.com/enbility/eebus-go v0.6.1, github.com/enbility/ship-go v0.5.3, github.com/enbility/spine-go v0.6.2). I can try creating a minimal reproducer for the issue
I threw up a branch with a minimal reproducer for the segmentation fault here: https://github.com/ConsolinnoEnergy/eebus-go/tree/test/issue-104-race-condition-reproducer on top of cmd/hems. To reproduce the crash you can run this (slightly) modified hems runner twice on the same machine (or the same network) and both nodes should crash ~2s after the nodes connect to each other (the LPC-client can also be swapped out with any other compliant LPC-client).
The crashing command output should then look something like this:
% go run cmd/hems/main.go 2345 $(<scratch/client/ski) scratch/server/cert.pem scratch/server/key.pem
2024-08-19 09:53:34 INFO Local SKI: 7fb83ba42d6b4bdf66c76bb7e4191c01389a2e3b
2024/08/19 09:53:36 OnLPCEvent: cs-lpc-UseCaseSupportUpdate
2024/08/19 09:53:36 OnLPCEvent: cs-lpc-DataUpdateHeartbeat
2024/08/19 09:53:37 OnLPCEvent: cs-lpc-WriteApprovalRequired
Approving LPC write with msgCounter 64 and limit 5000 W
2024/08/19 09:53:37 Sending limit of 5000W to d:_n:Demo_HEMS-123456789:[1]:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x857391]
goroutine 209 [running]:
github.com/enbility/spine-go/spine.(*FeatureLocal).processWrite(0xc000216240, 0xc00062d608)
/home/simon/projects/intern/02_eebus/eebus-go/vendor/github.com/enbility/spine-go/spine/feature_local.go:674 +0xb1
github.com/enbility/spine-go/spine.(*FeatureLocal).ApproveOrDenyWrite(0xc000216240, 0xc00062d608, {0x0, 0x0})
/home/simon/projects/intern/02_eebus/eebus-go/vendor/github.com/enbility/spine-go/spine/feature_local.go:236 +0x3e5
github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).ApproveOrDenyProductionLimit(0xc000254b70, 0x40, 0x1, {0x0, 0x0})
/home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/public.go:151 +0x246
github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).loadControlWriteCB.func1()
/home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/usecase.go:151 +0x4c
created by github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).loadControlWriteCB in goroutine 208
/home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/usecase.go:149 +0x318
exit status 2
go run cmd/hems/main.go 5432 $(<scratch/server/ski) scratch/client/cert.pem scratch/client/key.pem
2024-08-19 09:53:34 INFO Local SKI: 84fee6b0109a940e6ae24ff18ee0c8bfa6810514
2024/08/19 09:53:36 OnLPCEvent: cs-lpc-UseCaseSupportUpdate
2024/08/19 09:53:37 Sending limit of 5000W to d:_n:Demo_HEMS-123456789:[1]:
2024/08/19 09:53:37 OnLPCEvent: cs-lpc-WriteApprovalRequired
Approving LPC write with msgCounter 63 and limit 5000 W
2024/08/19 09:53:38 OnLPCEvent: cs-lpc-DataUpdateHeartbeat
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x857391]
goroutine 292 [running]:
github.com/enbility/spine-go/spine.(*FeatureLocal).processWrite(0xc000216240, 0xc0003d2588)
/home/simon/projects/intern/02_eebus/eebus-go/vendor/github.com/enbility/spine-go/spine/feature_local.go:674 +0xb1
github.com/enbility/spine-go/spine.(*FeatureLocal).ApproveOrDenyWrite(0xc000216240, 0xc0003d2588, {0x0, 0x0})
/home/simon/projects/intern/02_eebus/eebus-go/vendor/github.com/enbility/spine-go/spine/feature_local.go:236 +0x3e5
github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).ApproveOrDenyProductionLimit(0xc000254b70, 0x3f, 0x1, {0x0, 0x0})
/home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/public.go:151 +0x246
github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).loadControlWriteCB.func1()
/home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/usecase.go:151 +0x4c
created by github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).loadControlWriteCB in goroutine 291
/home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/usecase.go:149 +0x318
exit status 2
hey,
we were able to reproduce this for LPP as well, not however for LPC so far
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x49c6c0]
goroutine 359 [running]:
github.com/enbility/spine-go/spine.(*FeatureLocal).processWrite(0x4000136480, 0x400015a008)
github.com/enbility/[email protected]/spine/feature_local.go:674 +0xa0
github.com/enbility/spine-go/spine.(*FeatureLocal).ApproveOrDenyWrite(0x4000136480, 0x400015a008, {0x0, 0x0})
github.com/enbility/[email protected]/spine/feature_local.go:236 +0x478
github.com/enbility/eebus-go/usecases/cs/lpc.(*LPC).ApproveOrDenyConsumptionLimit(0x400017bb60, 0xc3f5, 0x1, {0x0, 0x0})
github.com/enbility/[email protected]/usecases/cs/lpc/public.go:151 +0x260
created by github.com/enbility/eebus-go/usecases/cs/lpc.(*LPC).loadControlWriteCB in goroutine 314
github.com/enbility/[email protected]/usecases/cs/lpc/usecase.go:148 +0x344
EDIT: The error appears only after LPC has been triggered. it's super easy to reproduce
Thanks @DAMEK86. Where you also able to test the suggested PR ?
ups, I tried it a few minutes ago with the latest keo ctrlbox implementation using
replace github.com/enbility/eebus-go => github.com/ConsolinnoEnergy/eebus-go v0.0.0-20240815114518-5c7fc4dc655b.
No crash with the changes so far -> nice work @sthelen-enqs!
Thanks for all this @sthelen-enqs ! Much appreciated