NServiceBus The end of processing a message is captured before the message is acked in the queueing system and is causing the audit headers for processing and critical time to be incorrect

Since the processing and critical time headers on audit message are captured when the pipeline completes it doesn't take into account the time it takes to:

Push outgoing messages to the transport
Complete the transport transaction (storage can be enlisted to storage operation times can be included here as well)

Unfortunately, there is no way to fix this unless we start to transfer audit data out of band of the audited message so I think deprecating those headers and guide users towards https://github.com/Particular/NServiceBus.Metrics is a better idea

This flaw impacts ServiceInsight since it shows processing time and critical time based on those headers.

Mar 08 '16 09:03 andreasohlund

Forked off https://github.com/Particular/NServiceBus/issues/3494. Labeled as core v6

Mar 08 '16 09:03 andreasohlund

This issue has no task force and no plan of attack. Is there anything more to do here other than wait for the associated PRs to be reviewed and merged by the appropriate maintainers?

Mar 11 '16 09:03 mikeminutillo

After triage today we decided to remove the v6 launch label since this is not a must fix for release. Description updated

Mar 15 '16 12:03 andreasohlund

Reopening, I had put "fix #thisissue" in the pull request text by misstake

Mar 15 '16 20:03 andreasohlund

@andreasohlund can we implement that as a minor release? Is it really Impact: M?

Apr 04 '16 13:04 tmasternak

Yes we can do it in a minor (the pull needs to be reverted back to have the perfcounters on by default)

But this means that we go live with no reliable way of monitor processing and critical time. Are we happy with that?

Apr 04 '16 13:04 andreasohlund

@Particular/tf-v6-launch-plat-dev: @mikeminutillo and I talked, can we consider not including these counters in v6 and adding them in a minor release or even a separate package?

Apr 05 '16 01:04 boblangley

can we consider not including these counters in v6 and adding them in a minor release or even a separate package?

Are you saying to ship v6 without the CriticalTime and SLA counters?

Apr 05 '16 10:04 andreasohlund

How about shipping with not so accurate counters?

Apr 05 '16 12:04 danielmarbach

How about shipping with not so accurate counters?

That is a possibility, the snag (I guess we could potentially live with it) is that the accuracy of the perfcounters was an argument to live with the inaccurate values we stamp on the audit messages

https://github.com/Particular/NServiceBus/issues/3494#issuecomment-187632515 (see bullet point 2)

so in essense we no longer have any accurate counters for critical time in v6. Are we ok with this?

Apr 05 '16 13:04 andreasohlund

I know critical time and SLA are part of the ADSD course. But do we know how many are actually using this value to do any meaningful with it?

Apr 05 '16 15:04 danielmarbach

Are you saying to ship v6 without the CriticalTime and SLA counters?

Yes. This is exactly what we were asking / suggesting. If they are not accurate, are they even useful? I was hoping you would tell me why not including them is a bad idea. :smile:

Apr 05 '16 22:04 boblangley

@danielmarbach @boblangley - I've dealt with a lot of desk / support cases, and when dealing with issues regarding performance, I always ask the customers to check these values. Comes in handy for troubleshooting.

Apr 05 '16 23:04 indualagarsamy

How about shipping with not so accurate counters?

How inaccurate are they? Is it better to tell a lie that's close to the truth or to remain silent in this case? Is it worth getting interested parties together for a triage call to decide what to do about this one?

Apr 06 '16 04:04 mikeminutillo

@mikeminutillo - regarding your question, currently this is what we have in our doco here: http://docs.particular.net/nservicebus/operations/performance-counters#nservicebus-performance-counters-critical-time

Versions 6 and above: The value recorded in the TimeSent header is the time when the call to send the message is executed, not the actual time when the message was dispatched to the transport infrastructure. Since the outgoing messages in handlers are sent as a batched operation, depending on how long the message handler takes to complete, the actual dispatch may happen later than the time recorded in the TimeSent header. For operations outside of handlers the recorded sent time is accurate.

Apr 06 '16 06:04 indualagarsamy

Talked to @danielmarbach and @boblangley and we decided to not include this in v6 since:

The inaccuracy is only really significant for broker transports where latency is an issue (asb and asq)
On azure we don't have a good way to install the counters anyway
We can fix this in a minor

I will raise a docs pull to clarify this inaccuracy (TBD - link)

Removing the consider for v6 label

Apr 07 '16 12:04 andreasohlund

We currently have a customer which has stability issues. I wanted to suggest to setup performance monitoring and to keep track of the "Processing time" metric. The problem probably is caused by RavenDB its session.SaveChanges() to slowly increase over time. I wanted to potentially proof this as "Processing time" should reflect this build up. After investigation it seems that "Processing time" is based on timing the IIncomingPhysicalMessageContext stage. This excludes critical processing durations:

The incoming message retrieval duration
Message dispatching duration which is already identified in this issue
Outbox save duration
Incoming message transport commit / acknowledge duration

Due to this I, unfortunately, cannot use this metric.

I also much like question the Processing time usefulness in the current state if this is all excluded. Only the Saga instance retrieval duration problem is currently included.

Mar 19 '18 10:03 ramonsmits

Since the performance counters are driven by the metrics package I have raised https://github.com/Particular/NServiceBus.Metrics/issues/174 to track the fix there and refocused this issue on the message headers that we include in audited messages.

Unfortunately, the headers can't be fixed until we change the core to use a model where audit data is transmitted out of band of the processed message since we need to capture the time as we're emitting the audit message which is to soon.

Given that we can guide users to use the metrics package to integrate with ServiceControl or their monitoring system of choice I think we should consider deprecating the processing time and critical time headers due to this

Feb 20 '21 10:02 andreasohlund

NServiceBus NServiceBus copied to clipboard

The end of processing a message is captured before the message is acked in the queueing system and is causing the audit headers for processing and critical time to be incorrect

NServiceBus
NServiceBus copied to clipboard