sonic-swss-common icon indicating copy to clipboard operation
sonic-swss-common copied to clipboard

Reduce unnecessary producerStateTable publish notifications

Open jipanyang opened this issue 6 years ago • 5 comments

Signed-off-by: Jipan Yang [email protected]

The typical use case that this PR is trying to deal with is related to route flapping.

Assuming such a scenario: <1> 4K routes learned on an interface.

<2> interface down, 4k routes withdraw. <3> fpmsyncd inserts 4k route keys into producerStateTabk keyset and generates 4k redis publish notifications

<4> Interface goes up

<4.A> 
Interface may go up and learn the 4k routes back  before orchagent has chance to pick up any notification and process the keyset.  That is fine, change in https://github.com/Azure/sonic-swss-common/pull/257 handled this case, no new redis publish notification will be created since the route keys already exist in producerStateTabk keyset and are to be processed. 

<4.B> 
Interface may go up and learn the 4k routes back after orchagent has processed a few notifications but not all the redis notifications (say, 1k notification processed),  with the first notification,  consumer in orchagent batch popped all the 4k route keys from producerStateTabk keyset,  (orchagent has set popBatchSize as 8192),  for the other 999 notifications processed, orchagent just call consumer_state_table_pops.lua 999 times without seeing any real data.
4k - 1k = 3k more redis notifications pending in redis queue now.

<5> Interface goes down

<5.A>  If <4.A> was hit during previous interface up, good, no new redis publish notification will be generated

<5.B>  If <4.B> was hit during previous interface up. 4K more redis notifications put into the redis queue.

<6> Interface goes up, loop back to <4>

The exact scenarios vary, but from <4> to <6>, number of pending notifications keeps increasing.

The issue is that orchagent may be abled to handle all the route changes with the batch processing (popBatchSize = 8192), redis publish notification was generated for each key and orchagent has to respond to each of the notifications with current select framework though most of them bring no value.

The changes in this PR has two parts:

<1> Only generate redis publish notification when absolutely necessary. In prodcueStateTable, generate notification for the first key in an empty keyset. In consumerStateTable, consume all the data in keyset upon redis notification. If by any chance, the number of keys in keyset is larger than popBatchSize, generate one notification to itself to come back again.

<2> consumerStateTable.pop() processing pop() method is not used by orchagent. but some unit test cases only. User of consumerStateTable.pop() process one key for each notification. So the work around is we explicitly create m_buffer.size() of redis publish notifications for select() frame to pickup all the buffered keys.

jipanyang avatar Feb 01 '19 23:02 jipanyang

This is the implementation done together with @zhenggen-xu

jipanyang avatar Feb 01 '19 23:02 jipanyang

@jipanyang , can you tell us which one you prefer?

lguohan avatar Feb 02 '19 00:02 lguohan

@lguohan This one probabally will be able to handle the most extreme redis usage scenario, but the change requires more careful review.

I need to check and fix the pyext part, forgot to run this test locally.

jipanyang avatar Feb 02 '19 00:02 jipanyang

Could you explain the problem and the high level design in this PR? That may help us understand your code.

qiluo-msft avatar Feb 05 '19 22:02 qiluo-msft

@qiluo-msft problem description and high level design updated.

jipanyang avatar Feb 12 '19 01:02 jipanyang