NServiceBus
NServiceBus copied to clipboard
The end of processing a message is captured before the message is acked in the queueing system and is causing the audit headers for processing and critical time to be incorrect
Since the processing and critical time headers on audit message are captured when the pipeline completes it doesn't take into account the time it takes to:
- Push outgoing messages to the transport
- Complete the transport transaction (storage can be enlisted to storage operation times can be included here as well)
Unfortunately, there is no way to fix this unless we start to transfer audit data out of band of the audited message so I think deprecating those headers and guide users towards https://github.com/Particular/NServiceBus.Metrics is a better idea
This flaw impacts ServiceInsight since it shows processing time and critical time based on those headers.
Forked off https://github.com/Particular/NServiceBus/issues/3494. Labeled as core v6
This issue has no task force and no plan of attack. Is there anything more to do here other than wait for the associated PRs to be reviewed and merged by the appropriate maintainers?
After triage today we decided to remove the v6 launch label since this is not a must fix for release. Description updated
Reopening, I had put "fix #thisissue" in the pull request text by misstake
@andreasohlund can we implement that as a minor release? Is it really Impact: M
?
Yes we can do it in a minor (the pull needs to be reverted back to have the perfcounters on by default)
But this means that we go live with no reliable way of monitor processing and critical time. Are we happy with that?
@Particular/tf-v6-launch-plat-dev: @mikeminutillo and I talked, can we consider not including these counters in v6 and adding them in a minor release or even a separate package?
can we consider not including these counters in v6 and adding them in a minor release or even a separate package?
Are you saying to ship v6 without the CriticalTime and SLA counters?
How about shipping with not so accurate counters?
How about shipping with not so accurate counters?
That is a possibility, the snag (I guess we could potentially live with it) is that the accuracy of the perfcounters was an argument to live with the inaccurate values we stamp on the audit messages
https://github.com/Particular/NServiceBus/issues/3494#issuecomment-187632515 (see bullet point 2)
so in essense we no longer have any accurate counters for critical time in v6. Are we ok with this?
I know critical time and SLA are part of the ADSD course. But do we know how many are actually using this value to do any meaningful with it?
Are you saying to ship v6 without the CriticalTime and SLA counters?
Yes. This is exactly what we were asking / suggesting. If they are not accurate, are they even useful? I was hoping you would tell me why not including them is a bad idea. :smile:
@danielmarbach @boblangley - I've dealt with a lot of desk / support cases, and when dealing with issues regarding performance, I always ask the customers to check these values. Comes in handy for troubleshooting.
How about shipping with not so accurate counters?
How inaccurate are they? Is it better to tell a lie that's close to the truth or to remain silent in this case? Is it worth getting interested parties together for a triage call to decide what to do about this one?
@mikeminutillo - regarding your question, currently this is what we have in our doco here: http://docs.particular.net/nservicebus/operations/performance-counters#nservicebus-performance-counters-critical-time
Versions 6 and above: The value recorded in the TimeSent header is the time when the call to send the message is executed, not the actual time when the message was dispatched to the transport infrastructure. Since the outgoing messages in handlers are sent as a batched operation, depending on how long the message handler takes to complete, the actual dispatch may happen later than the time recorded in the TimeSent header. For operations outside of handlers the recorded sent time is accurate.
Talked to @danielmarbach and @boblangley and we decided to not include this in v6 since:
- The inaccuracy is only really significant for broker transports where latency is an issue (asb and asq)
- On azure we don't have a good way to install the counters anyway
- We can fix this in a minor
I will raise a docs pull to clarify this inaccuracy (TBD - link)
Removing the consider for v6 label
We currently have a customer which has stability issues. I wanted to suggest to setup performance monitoring and to keep track of the "Processing time" metric. The problem probably is caused by RavenDB its session.SaveChanges()
to slowly increase over time. I wanted to potentially proof this as "Processing time" should reflect this build up. After investigation it seems that "Processing time" is based on timing the IIncomingPhysicalMessageContext
stage. This excludes critical processing durations:
- The incoming message retrieval duration
- Message dispatching duration which is already identified in this issue
- Outbox save duration
- Incoming message transport commit / acknowledge duration
Due to this I, unfortunately, cannot use this metric.
I also much like question the Processing time usefulness in the current state if this is all excluded. Only the Saga instance retrieval duration problem is currently included.
Since the performance counters are driven by the metrics package I have raised https://github.com/Particular/NServiceBus.Metrics/issues/174 to track the fix there and refocused this issue on the message headers that we include in audited messages.
Unfortunately, the headers can't be fixed until we change the core to use a model where audit data is transmitted out of band of the processed message since we need to capture the time as we're emitting the audit message which is to soon.
Given that we can guide users to use the metrics package to integrate with ServiceControl or their monitoring system of choice I think we should consider deprecating the processing time and critical time headers due to this