go
go copied to clipboard
services/horizon: Race condition in ingestion system
During Horizon integration tests, we found a race condition in the ingestion system #4987 Currently, it only occurs in tests so address the failing tests so we fixed it by adding a wait. We want to deliberate whether we should address the race condition and explore potential solutions, if we decide to fix it.
Race condition analysis:
The race condition occurs because the wg sync.WaitGroup
in ingestion system is accessed concurrently by two different goroutines.
In the runStateMachine
function s.wg.Add(1)
is used to add a task to the WaitGroup.
In the Shutdown
function s.wg.Wait()
is used to wait for all tasks in the WaitGroup to complete.
While the Wait() and Done() methods of the WaitGroup are thread-safe, the Add() method is not. The reason this issue hasn't been seen earlier is that wg.Add()
executed only once before entering the state machine loop. Once the state machine starts, the only access to the WaitGroup is through Done() and Wait(), both of which are thread-safe.
Because the write to the WaitGroup happens only once at the very start of the ingestion system and the Shutdown
function is not called (in all practical scenarios) before that is executed.
However, in the case of the test we start Horizon and immediately shut it down resulting in concurrent access to the WaitGroup and causing the race condition.