go icon indicating copy to clipboard operation
go copied to clipboard

services/horizon: Race condition in ingestion system

Open urvisavla opened this issue 1 year ago • 0 comments

During Horizon integration tests, we found a race condition in the ingestion system #4987 Currently, it only occurs in tests so address the failing tests so we fixed it by adding a wait. We want to deliberate whether we should address the race condition and explore potential solutions, if we decide to fix it.

Race condition analysis: The race condition occurs because the wg sync.WaitGroup in ingestion system is accessed concurrently by two different goroutines.

In the runStateMachine function s.wg.Add(1) is used to add a task to the WaitGroup. In the Shutdown function s.wg.Wait() is used to wait for all tasks in the WaitGroup to complete.

While the Wait() and Done() methods of the WaitGroup are thread-safe, the Add() method is not. The reason this issue hasn't been seen earlier is that wg.Add() executed only once before entering the state machine loop. Once the state machine starts, the only access to the WaitGroup is through Done() and Wait(), both of which are thread-safe.

Because the write to the WaitGroup happens only once at the very start of the ingestion system and the Shutdown function is not called (in all practical scenarios) before that is executed. However, in the case of the test we start Horizon and immediately shut it down resulting in concurrent access to the WaitGroup and causing the race condition.

urvisavla avatar Aug 01 '23 17:08 urvisavla