exanic-software icon indicating copy to clipboard operation
exanic-software copied to clipboard

Potential data race in the lock free circular buffer implementation

Open ujos opened this issue 3 years ago • 0 comments

I'm reading the implementation of the exanic_receive_frame function. Looks like there is a potential bug in the way how lock-free queue is implemented.

The implementation does not handle the case when consumer thread (thread that calls exanic_receive_frame function) reads the chunk and at the same time producer thread (FPGA) stores the new portion of data after new round of circle (generation is increased).

It has to be a very slow application to reproduce this issue.

Following is a sequence of actions that cause the bug:

  1. Consumer: checks the generation of the Chunk. It is equal to expected (Gen == 1)
  2. Consumer: reads first 16 bytes of the chunk
  3. Producer: made a lap, increased the geneartion and now writes exactly to the same chunk as Consumer reads
  4. Producer: updates the generation of the chunk (Gen = 2)
  5. Producer: writes 120 bytes to the chunk (Gen 2)
  6. Consumer reads the rest 104 bytes (120 (chunk size) -16) of the data published as Gen 2 data. So that the first 16 bytes are Gen1 bytes while the rest 104 bytes are Gen2 bytes (different Ethernet frame)

The lock free circular buffer usually is implemented in the following way:

Producer:

  1. Put PrevChunkSeqNum+1 to the generation of the chunk
  2. Put memory barrier - to ensure that PrevChunkSeqNum+1 is written before we write the data
  3. Write data
  4. Put memory barrier - to ensure that data is written before PrevChunkSeqNum+2 is set
  5. Writes PrevChunkSeqNum+2 to the generation
  6. PrevChunkSeqNum += 2

Consumer:

  1. Reads ChunkSeqNum. If value is odd, goto 1;
  2. If ChunkSeqNum value is less than expected, goto 1
  3. Put memory barrier - to ensure that we read ChunkSeqNum before data is read
  4. Read data
  5. Put memory barrier - to ensure we read data before the final ChunkSeqNum check
  6. Read ChunkSeqNum and compare to what we read at step 1. If it is changed, means that chunk is updated while we read the data; thus goto 1 and start from scratch

ujos avatar May 17 '21 10:05 ujos