gnmi icon indicating copy to clipboard operation
gnmi copied to clipboard

RFC: Consideration of "End of Sample" marking

Open earies opened this issue 3 months ago • 6 comments

As a follow-on to https://github.com/openconfig/reference/pull/231 which is clarifying the behavior around SAMPLE subscriptions being "sample-only" and not interleaving other "actions" (e.g. delete), this raises another point in how clients should reconcile data for downstream systems when objects are removed between intervals.

Drawing parallels to prior sampling methods (e.g. client initiated snmp get/get-next or get-next sequence - aka. walk, IPFIX, etc..), a client either receives some signal from the endpoint in a PDU (EOM type), an error or in some cases no indicator as to when that sample window is complete making client side reconciliation variable and sometimes based on trust/time boundaries.

Prior to raising a PR for any spec/IDL updates, this issue is a request for comments from both the implementation/operator community for setting an indicator (e.g. sync_response) to complete the wrap for each SAMPLE interval.

Options being:

  • Leverage sync_response for this purpose
  • Introduce a new message field

This can apply to:

  • SAMPLE mode subscriptions
  • When heartbeat_interval is triggered in ON_CHANGE or TARGET_DEFINED mode subscriptions

With an indication of when a SAMPLE interval is complete, a client can better understand data that no longer exists on the endpoint in order to perform appropriate actions downstream.

earies avatar Oct 02 '25 23:10 earies

@rgwilton @jsterne @ncorran @dplore to comment (and tag others)

earies avatar Oct 03 '25 00:10 earies

I’m not convinced that introducing an explicit “End of Sample” indicator is the right solution here since ensuring non-overlapping sample windows can be complex and lead to unnecessary buffering on both server and client sides.

The core problem seems to be: How can a client tell whether previously received data is still valid, or if some of it has gone stale?

A few patterns already exist that can help address this:

  • Pair SAMPLE with ON_CHANGE (or ON_CHANGE + heartbeat interval): This allows clients to receive delete notifications. While not always possible (paths not supporting ON_CHANGE, or when subscribing to counters), it’s often effective. For list-based data, clients can subscribe to list keys in ON_CHANGE mode to detect deletions that invalidate child objects.

  • Use a cache with TTL: Clients can maintain a TTL slightly longer than the sample interval, automatically expiring values that no longer appear in subsequent samples. Prometheus follows this model: metrics that don’t show up in a new scrape are marked as stale after a grace period.

  • Observe repeated values across samples: When the same leaf is received in consecutive SAMPLE intervals, it suggests that the full dataset for that interval has been delivered. This doesn’t provide a strict “end of sample” boundary and requires waiting until the next sample but it is still useful.

  • POLL mode or repeated ONCE subscriptions: If we draw parallels to prior approaches, I would argue that gNMI already provides something similar in POLL mode. A client initiates a Subscribe RPC with POLL mode, receives a full sample and a sync_response, and can then issue subsequent Poll requests at regular intervals. Each Poll is bounded by a sync_response, effectively giving the client the “end of sample” semantics without needing a new indicator. I recognize that maybe POLL is not widely implemented, but the same behavior can be achieved with repeated ONCE subscriptions: The client has to create a new stream at each interval (slightly more expensive than POLL), and each request is completed with a sync_response.

karimra avatar Oct 03 '25 04:10 karimra

@earies @dplore @aashaikh I propose sending periodic sync_response is the best way to mark the end of sample. It does not harm in any way and does not introduce any backward compatibility issues. When we use this method in Poll request, why can't we use it in Subscribe request?

ashu-ciena avatar Oct 08 '25 14:10 ashu-ciena

@earies @dplore @aashaikh I propose sending periodic sync_response is the best way to mark the end of sample. It does not harm in any way and does not introduce any backward compatibility issues. When we use this method in Poll request, why can't we use it in Subscribe request?

One issue I see in the reuse of sync_response and I believe is one of the concerns that @karimra raises is that this boolean is the only field emitted in a SubscribeResponse thus if a SubscribeRequest contains a list of Subscription that have varying sample_interval defined, there is no way to associate which datasets are "complete" - there needs to be a means to draw this marker association back to the Subscription.

Today, if a list of Subscription carries different intervals, the initial reap occurs at ~T=0 thus once the entirety of the requested datasets are complete a sync_response=true is sent and thus not an issue - subsequent datasets then start to become misaligned immediately.

@karimra raises very good points here and is the intent of this issue to spark discussion. The first being "is this an issue?" and/or how can this be circumvented. The alternatives mentioned above are then what I would consider as guidance as implementors or consumers merely reading the specification cannot glean this level of detail often until diving in to the full e2e pipeline.

POLL is another topic I had queued for a separate discussion as you rightly point out implementation status.

If we deem that we should not consider a marker/association within the protocol itself, then I do think some implementation/collector guidance is in order that can accompany the spec.

earies avatar Oct 09 '25 01:10 earies

@ElodinLaarz for some context around this issue, can you comment on how our collector handles SAMPLE mode streaming and why we are not dependent on an "end of sample" marker.

dplore avatar Oct 29 '25 00:10 dplore

It took me a while to understand how an end-of-collection marker would help, but I suppose the intent would be "after you get two end-of-marker notifications, you can delete any notifications that you didn't receive in the most recent collection," but this is actually quite complicated, and probably requires you're also caching all updates from each individual collection (temporarily)?


As for what we do, we periodically generate a delete for the SAMPLE subscription container "less often than the collection rate."

e.g. if we subscribe to openconfig/interfaces/interface/state with a thirty second cadence, we process a notification openconfig/interfaces/interface/state with a timestamp of now - c * 30 seconds for some integer c, every c * 30 seconds.

This clears the cache of stale notifications with timestamp earlier than now - c * 30 seconds, but it has a problem: If the values are sent with a hardware timestamp, which is rarely/never changing, then we actually delete "real"/up-to-date data from the cache until the next collection completes. So, smaller c does a better job in keeping the cache clear of stale notifications, and a larger c decreases the number of intervals over which you have missing up-to-date data.

ElodinLaarz avatar Nov 03 '25 19:11 ElodinLaarz