eebus-go icon indicating copy to clipboard operation
eebus-go copied to clipboard

Potential RaceCondition in CS-LPC and CS-LPP in WriteApprovalCallbacks

Open sthelen-enqs opened this issue 1 year ago • 6 comments

When a Feature for a LocalEntity has multiple WriteApprovalCallbacks (e.g. LPC+LPP) and a command that requires write approval is sent for one of these features, depending on the order in which the WriteApprovalCallbacks are executed a nil pointer dereference can occur causing a segmentation fault.

In general the segmentation fault occurs when the ApproveOrDenyWrite method is called on a ship-go FeatureLocalInterface, the passed *Message is not complete (i.e. contains only a valid MsgCounter) and "completes" the number of required approvals such that processWrite is then called with an incomplete *Message.

One potential fix for this issue would be splitting the public ApproveOrDenyProductionLimit messages into a public and private part where the private part directly takes a *shipapi.Message which it can pass to ApproveOrDenyWrite and can be called from the WriteApprovalCallback. Another potential fix would be to adapt ApproveOrDenyWrite in ship-go to cache the "most valid" message and pass that through to processWrite instead of passing the last message.

If we decide on the first option, we should definitely document the API contract for ApproveOrDenyWrite that it only accepts "valid" messages and may crash when passed incomplete Messages.

Backtrace

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x8ca891]

goroutine 181 [running]:
github.com/enbility/spine-go/spine.(*FeatureLocal).processWrite(0xc000000300, 0xc000632008)
        /home/simon/projects/intern/02_eebus/eebus-servers/vendor/github.com/enbility/spine-go/spine/feature_local.go:674 +0xb1
github.com/enbility/spine-go/spine.(*FeatureLocal).ApproveOrDenyWrite(0xc000000300, 0xc000632008, {0x0, 0x0})
        /home/simon/projects/intern/02_eebus/eebus-servers/vendor/github.com/enbility/spine-go/spine/feature_local.go:236 +0x3e5
github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).ApproveOrDenyProductionLimit(0xc0002b2cc0, 0x128, 0x1, {0x0, 0x0})
        /home/simon/projects/intern/02_eebus/eebus-servers/vendor/github.com/enbility/eebus-go/usecases/cs/lpp/public.go:151 +0x246
created by github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).loadControlWriteCB in goroutine 122
        /home/simon/projects/intern/02_eebus/eebus-servers/vendor/github.com/enbility/eebus-go/usecases/cs/lpp/usecase.go:148 +0x325

Illustrative Example

  1. A device registers the usecases cs/lpc and cs/lpp which both register their loadControlWriteCBs.
  2. The device receives an LPC command and the WriteApprovalCallbacks for both use cases are entered in separate goroutines "simultaneously" (LPC, LPP)
  3. The LPC handler proceeds through line 140 and calls the EventCB registered on the LPC use-case.
  4. The LPC EventCB calls ApproveOrDenyConsumptionLimit which passes through to line 159 and calls ApproveOrDenyWrite
  5. ApproveOrDenyWrite runs through until line 215 where since it's the first handler out of two handlers total it sets r.writeApprovalReceived[*msg.RequestHeader.MsgCounter] = 1 and returns (1 < 2 line 222)
  6. The LPP handler then proceeds through to line 148 because limitId != item.LimitId for all entries in the LoadControlLimitData and calls ApproveOrDenyProductionLimit with the MsgCounter (and only the MsgCounter of the message)
  7. ApproveOrDenyProductionLimit continues until line 143 where it enters the !ok branch because there are no pending LPP limits.
  8. ApproveOrDenyProductionLimit then creates a new spineapi.Message which only has RequestHeader.MsgCounter set (all other fields are nil) and calls ApproveOrDenyWrite
  9. ApproveOrDenyWrite continues until line 215 sets r.writeApprovalReceived[*msg.RequestHeader.MsgCounter] = 1 + 1 passes through the check on line 222 (2 >= 2) and continues until line 236 to call processWrite
  10. processWrite calls executeWrite which calls msg.Cmd.Data() which returns errors.New("Data not found in Cmd") because Cmd is zero-initialized and all its fields are nil.
  11. executeWrite then returns model.NewErrorType(model.ErrorNumberTypeCommandNotSupported, err.Error()) and execution continues in processWrite on line 674 with a call to msg.FeatureRemote.Device().Sender().ResultError(msg.RequestHeader, r.Address(), err) which proceeds to segfault on calling the Device() method of the FeatureRemote interface which was zero-initialized to nil.

This segfault also occurs when the LPP handler completes first for an LPP command (then segfaulting during the LPC-handler). The segfault does not occur when the LPC handler completes second for an LPC command or the LPP handler completes second for an LPP command as then the actual message is retrieved from the pendingLimits map and passed to ApproveOrDenyWrite.

sthelen-enqs avatar Aug 15 '24 11:08 sthelen-enqs

The simplest way to reproduce this is to replace lpp/usecase.go line 148 with

	go func() {
		time.Sleep(1 * time.Second)
		e.ApproveOrDenyProductionLimit(*msg.RequestHeader.MsgCounter, true, "")
	}()

And then send an LPC command while both LPC and LPP are active. This will guarantee that the LPP-Handler is called after the LPC-Handler for all commands (which will segfault for LPC commands).

sthelen-enqs avatar Aug 15 '24 11:08 sthelen-enqs

I've dropped an implementation of the first option in #105

sthelen-enqs avatar Aug 15 '24 11:08 sthelen-enqs

Thank you @sthelen-enqs for this tremendous bug report and the PR! This looks awesome.

I'll try to look into it as soon as possible.

DerAndereAndi avatar Aug 15 '24 17:08 DerAndereAndi

@sthelen-enqs, can you provide the affected eebus-go and ship revisions? I don't see this behavior. If you like we can also have a debug session to check the differences between our implementations. ;)

Greetings from Ulm

DAMEK86 avatar Aug 15 '24 19:08 DAMEK86

I could reproduce the issue with at least the current versions (github.com/enbility/eebus-go v0.6.1, github.com/enbility/ship-go v0.5.3, github.com/enbility/spine-go v0.6.2). I can try creating a minimal reproducer for the issue

sthelen-enqs avatar Aug 15 '24 20:08 sthelen-enqs

I threw up a branch with a minimal reproducer for the segmentation fault here: https://github.com/ConsolinnoEnergy/eebus-go/tree/test/issue-104-race-condition-reproducer on top of cmd/hems. To reproduce the crash you can run this (slightly) modified hems runner twice on the same machine (or the same network) and both nodes should crash ~2s after the nodes connect to each other (the LPC-client can also be swapped out with any other compliant LPC-client).

The crashing command output should then look something like this:

% go run cmd/hems/main.go 2345 $(<scratch/client/ski) scratch/server/cert.pem scratch/server/key.pem
2024-08-19 09:53:34 INFO  Local SKI:  7fb83ba42d6b4bdf66c76bb7e4191c01389a2e3b
2024/08/19 09:53:36 OnLPCEvent: cs-lpc-UseCaseSupportUpdate
2024/08/19 09:53:36 OnLPCEvent: cs-lpc-DataUpdateHeartbeat
2024/08/19 09:53:37 OnLPCEvent: cs-lpc-WriteApprovalRequired
Approving LPC write with msgCounter 64 and limit 5000 W
2024/08/19 09:53:37 Sending limit of 5000W to d:_n:Demo_HEMS-123456789:[1]:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x857391]

goroutine 209 [running]:
github.com/enbility/spine-go/spine.(*FeatureLocal).processWrite(0xc000216240, 0xc00062d608)
        /home/simon/projects/intern/02_eebus/eebus-go/vendor/github.com/enbility/spine-go/spine/feature_local.go:674 +0xb1
github.com/enbility/spine-go/spine.(*FeatureLocal).ApproveOrDenyWrite(0xc000216240, 0xc00062d608, {0x0, 0x0})
        /home/simon/projects/intern/02_eebus/eebus-go/vendor/github.com/enbility/spine-go/spine/feature_local.go:236 +0x3e5
github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).ApproveOrDenyProductionLimit(0xc000254b70, 0x40, 0x1, {0x0, 0x0})
        /home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/public.go:151 +0x246
github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).loadControlWriteCB.func1()
        /home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/usecase.go:151 +0x4c
created by github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).loadControlWriteCB in goroutine 208
        /home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/usecase.go:149 +0x318
exit status 2
go run cmd/hems/main.go 5432 $(<scratch/server/ski) scratch/client/cert.pem scratch/client/key.pem
2024-08-19 09:53:34 INFO  Local SKI:  84fee6b0109a940e6ae24ff18ee0c8bfa6810514
2024/08/19 09:53:36 OnLPCEvent: cs-lpc-UseCaseSupportUpdate
2024/08/19 09:53:37 Sending limit of 5000W to d:_n:Demo_HEMS-123456789:[1]:
2024/08/19 09:53:37 OnLPCEvent: cs-lpc-WriteApprovalRequired
Approving LPC write with msgCounter 63 and limit 5000 W
2024/08/19 09:53:38 OnLPCEvent: cs-lpc-DataUpdateHeartbeat
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x857391]

goroutine 292 [running]:
github.com/enbility/spine-go/spine.(*FeatureLocal).processWrite(0xc000216240, 0xc0003d2588)
        /home/simon/projects/intern/02_eebus/eebus-go/vendor/github.com/enbility/spine-go/spine/feature_local.go:674 +0xb1
github.com/enbility/spine-go/spine.(*FeatureLocal).ApproveOrDenyWrite(0xc000216240, 0xc0003d2588, {0x0, 0x0})
        /home/simon/projects/intern/02_eebus/eebus-go/vendor/github.com/enbility/spine-go/spine/feature_local.go:236 +0x3e5
github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).ApproveOrDenyProductionLimit(0xc000254b70, 0x3f, 0x1, {0x0, 0x0})
        /home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/public.go:151 +0x246
github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).loadControlWriteCB.func1()
        /home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/usecase.go:151 +0x4c
created by github.com/enbility/eebus-go/usecases/cs/lpp.(*LPP).loadControlWriteCB in goroutine 291
        /home/simon/projects/intern/02_eebus/eebus-go/usecases/cs/lpp/usecase.go:149 +0x318
exit status 2

sthelen-enqs avatar Aug 19 '24 07:08 sthelen-enqs

hey,

we were able to reproduce this for LPP as well, not however for LPC so far

 panic: runtime error: invalid memory address or nil pointer dereference
 [signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x49c6c0]
  
 goroutine 359 [running]:
 github.com/enbility/spine-go/spine.(*FeatureLocal).processWrite(0x4000136480, 0x400015a008)
         github.com/enbility/[email protected]/spine/feature_local.go:674 +0xa0
 github.com/enbility/spine-go/spine.(*FeatureLocal).ApproveOrDenyWrite(0x4000136480, 0x400015a008, {0x0, 0x0})
         github.com/enbility/[email protected]/spine/feature_local.go:236 +0x478
 github.com/enbility/eebus-go/usecases/cs/lpc.(*LPC).ApproveOrDenyConsumptionLimit(0x400017bb60, 0xc3f5, 0x1, {0x0, 0x0})
         github.com/enbility/[email protected]/usecases/cs/lpc/public.go:151 +0x260
 created by github.com/enbility/eebus-go/usecases/cs/lpc.(*LPC).loadControlWriteCB in goroutine 314
         github.com/enbility/[email protected]/usecases/cs/lpc/usecase.go:148 +0x344

EDIT: The error appears only after LPC has been triggered. it's super easy to reproduce

DAMEK86 avatar Aug 21 '24 06:08 DAMEK86

Thanks @DAMEK86. Where you also able to test the suggested PR ?

DerAndereAndi avatar Aug 27 '24 11:08 DerAndereAndi

ups, I tried it a few minutes ago with the latest keo ctrlbox implementation using replace github.com/enbility/eebus-go => github.com/ConsolinnoEnergy/eebus-go v0.0.0-20240815114518-5c7fc4dc655b.

No crash with the changes so far -> nice work @sthelen-enqs!

DAMEK86 avatar Aug 27 '24 12:08 DAMEK86

Thanks for all this @sthelen-enqs ! Much appreciated

DerAndereAndi avatar Aug 27 '24 13:08 DerAndereAndi