phoenix icon indicating copy to clipboard operation
phoenix copied to clipboard

PHOENIX-7485: Add metrics for time taken during mutation plan creation and execution

Open sanjeet006py opened this issue 10 months ago • 14 comments

Summary of changes:

  • Added metrics to capture time taken in mutation plan creation and execution for upserts and deletes.
  • As the mutation plan creation and execution for a single upsert/delete can complete in less than 1 milli-second so, capturing metrics in Nano seconds. With PHOENIX-7484, it was observed that even mutation plan creation and execution is taking less than a milli second per call. If a regression is introduced in mutation plan creation and execution such that they start taking 1 milli-second per call then regression per call is not much in milli-seconds but at a scale of 2M rows being upserted/deleted it's huge. So, better to track at finer granularity to identify sub milli-second regression which can manifest as huge regression for end users.
  • Added support for computing elapsed time in nano seconds in EnvironmentEdgeManager and EnvironmentEdge.
  • Added metrics to track time taken by executeMutation call in nano seconds. There are already metrics available to track time taken by upserts and deletes in executeMutation call but they track at milli seconds granularity. If we change those to track at nano seconds granularity then this will be a breaking non-backward compatible change.

sanjeet006py avatar Jan 23 '25 09:01 sanjeet006py

@virajjasani @tkhurana @jpisaac can you please review the PR. Thanks

sanjeet006py avatar Jan 27 '25 16:01 sanjeet006py

@sanjeet006py Using nanoseconds resolution for measuring calls which themselves take less than 1ms adds considerable measurement overhead. AFAIK each call to System.nanosecond itself takes ~25- 30ns so for 2 calls you are adding double overhead. Now if the operation itself takes < 1 ms let us say 600ns you are adding a 10% measurement overhead.

https://www.javaadvent.com/2019/12/measuring-time-from-java-to-kernel-and-back.html

tkhurana avatar Jan 27 '25 17:01 tkhurana

@sanjeet006py Using nanoseconds resolution for measuring calls which themselves take less than 1ms adds considerable measurement overhead. AFAIK each call to System.nanosecond itself takes ~25- 30ns so for 2 calls you are adding double overhead. Now if the operation itself takes < 1 ms let us say 600ns you are adding a 10% measurement overhead.

https://www.javaadvent.com/2019/12/measuring-time-from-java-to-kernel-and-back.html

Thanks, it was a great article. I got two key takeaways from the article:

  • Minimize the number of calls to measure time.
  • Measuring time in millis and nanos has roughly same overhead.

I was of understanding that by measuring mutation plan creation and execution only, I am already minimizing amount of calls being made to track time. I have seen time taken by mutation plan creation and execution is of the order of 5-15 micro seconds (thus, using nano seconds clock). Should we just track time taken by a single executeMutation call and not track at further granularity? Though for executeMutation call we should track at nano seconds granularity.

sanjeet006py avatar Jan 27 '25 19:01 sanjeet006py

Have you looked at the time taken by the execution of mutationPlan for upsert select on a table which has large number of rows and client side deletes which first require rows to be scanned on the server to determine which rows to be deleted ? I am pretty sure their execution times will be much larger.

tkhurana avatar Jan 27 '25 23:01 tkhurana

Have you looked at the time taken by the execution of mutationPlan for upsert select on a table which has large number of rows and client side deletes which first require rows to be scanned on the server to determine which rows to be deleted ? I am pretty sure their execution times will be much larger.

I haven't benchmarked this use case explicitly but I agree that execution of mutation plan will be time taking in this due to needing to scan first to prepare row mutations. I benchmarked upsert value and that is pretty fast for a single call (5-15 micro seconds) thus, decided to measure elapsed time in nano seconds.

sanjeet006py avatar Jan 28 '25 04:01 sanjeet006py

@tkhurana based on the response in here, please let me know if there is a concern with the overall change in this PR. Thanks

sanjeet006py avatar Jan 29 '25 14:01 sanjeet006py

Should we just track time taken by a single executeMutation call and not track at further granularity? Though for executeMutation call we should track at nano seconds granularity.

IMO we don't need to track at further granularity. Just track time taken by a single executeMutation call. You can use nano seconds granularity for it.

tkhurana avatar Feb 06 '25 17:02 tkhurana

IMO we don't need to track at further granularity. Just track time taken by a single executeMutation call. You can use nano seconds granularity for it.

@tkhurana Recently, we saw that huge time was being taken in executeMutation call and the reason turned out to be excess 1 ms coming from mutation plan creation. So, we don't track and publish at this granularity then during debugging we won't know which area to look into i.e. mutation plan creation, execution or something else. Thus, I wanted to track at further granularity. WDYT?

sanjeet006py avatar Feb 07 '25 14:02 sanjeet006py

@sanjeet006py Majority of times the time to create mutationPlan is so insignificant that this metric won't be useful. The only time you are interested in this metric is when the time to create the plan is in the order of milliseconds. Maybe, you could simply log those cases.

tkhurana avatar Feb 10 '25 15:02 tkhurana

@sanjeet006py Majority of times the time to create mutationPlan is so insignificant that this metric won't be useful. The only time you are interested in this metric is when the time to create the plan is in the order of milliseconds. Maybe, you could simply log those cases.

Sure, that sounds good. But still I would need to track the time taken by mutation plan creation and then I would log that only when it crosses some threshold like 0.5 millis. I hope that is fine?

sanjeet006py avatar Feb 10 '25 16:02 sanjeet006py

Sure, that sounds good. But still I would need to track the time taken by mutation plan creation and then I would log that only when it crosses some threshold like 0.5 millis. I hope that is fine?

Just pick a threshold value that doesn't make the log too noisy since this is a heavily used code path.

tkhurana avatar Feb 10 '25 18:02 tkhurana

IMO we don't need to track at further granularity. Just track time taken by a single executeMutation call. You can use nano seconds granularity for it.

@tkhurana Recently, we saw that huge time was being taken in executeMutation call and the reason turned out to be excess 1 ms coming from mutation plan creation. So, we don't track and publish at this granularity then during debugging we won't know which area to look into i.e. mutation plan creation, execution or something else. Thus, I wanted to track at further granularity. WDYT?

1 ms was important for a huge time taken? How huge was the huge time?

Did the 1 ms include any metadata RPCs, and if so, should the metric be captured specifically for calls to SYSTEM.CATALOG or meta or something similar? In this way, we need not spend cycles measuring local work that is expected to be fast.

Is this metric/log only going to be useful in the cases where we send RPCs, or do we think that we really are spending a lot of time locally in planning, without any network calls?

Any JVM pause for GC or otherwise could likely last longer than 0.5 milliseconds, so the log message, if that's the choice, shouldn't be misleading that it is some kind of error or egregious performance scenario.

I suppose we do want to know metrics around how many RPCs are required to serve the request, especially when those include additional RPCs like system RPCs, which may not always be required. But those are more difficult to instrument, and so we are choosing to instrument mutation planning, only because it's a top-level "span" and we don't need to plumb the instrumentation all the way down through the code?

d-c-manning avatar Feb 19 '25 00:02 d-c-manning

1 ms was important for a huge time taken? How huge was the huge time?

Extra 1ms was spent in every executeUpdate() call and a user was trying to insert 2M rows so, it ended up taking ~26 mins while if we remove that 1ms from executeUpdate() then 2M rows can be inserted within 24-25 secs. And, the intention was to ultimately insert 40M rows which would shown regression in hrs. The extra 1ms was being taken to get table metadata from SYSCAT.

Did the 1 ms include any metadata RPCs, and if so, should the metric be captured specifically for calls to SYSTEM.CATALOG or meta or something similar? In this way, we need not spend cycles measuring local work that is expected to be fast.

The extra 1ms did come from time spent in call to SYSCAT to get table metadata and we can add a metric to specifically tell us if call to SYSCAT RS was done or not. But in my opinion a regression can be introduced in these paths at any time. This time the regression was extra metadata call but next time it can be something else. So, I think we can capture the time taken by stages to ensure that each stage is behaving as expected.

Is this metric/log only going to be useful in the cases where we send RPCs, or do we think that we really are spending a lot of time locally in planning, without any network calls?

I believe this metric can help us highlight any future regressions introduced in the path of mutation planning as this time the first task was to confirm that a regression has been introduced. Currently, because of lack of metrics for mutation planning stages we can't even tell if a regression has been introduced.

Any JVM pause for GC or otherwise could likely last longer than 0.5 milliseconds, so the log message, if that's the choice, shouldn't be misleading that it is some kind of error or egregious performance scenario.

Yeah, I plan to log a generic warning message that mutation planning took x time compared to some expected y threshold and I also plan to make that threshold configurable on client side via a config.

I suppose we do want to know metrics around how many RPCs are required to serve the request, especially when those include additional RPCs like system RPCs, which may not always be required. But those are more difficult to instrument, and so we are choosing to instrument mutation planning, only because it's a top-level "span" and we don't need to plumb the instrumentation all the way down through the code?

Actually, I didn't think of adding a metric to measure if RPC call is being done to SYSCAT to get metadata as my aim was to have a way to tell in future that regression has been introduced in mutation planning stage as its a very hot method call. Having such a metric to measure if RPC call was done to SYSCAT RS can help quickly root cause. So, once we know that a regression has been introduced in mutation planning then we can check such metric to quickly root cause. WDYT?

sanjeet006py avatar Feb 24 '25 09:02 sanjeet006py

Makes sense to me. But then if this is a heavily used code path, isn't a metric a better choice than a log? It sounds like you would want a threshold that is <1 ms, but that threshold can be exceeded on a JVM pause, or on any RPC call. It will be noisy, but it sounds like if it's much greater than 1ms, then it won't be useful to you. I understand there was some regression that had more metadata RPC calls than expected... but that it will still be normal for the occasional plan to require a metadata RPC. And if so, that wouldn't necessarily be a problem that deserves a log? Would the metric be sufficient then? Maybe I misunderstand the value we get from the log vs. the metric. CC @tkhurana

d-c-manning avatar Feb 25 '25 06:02 d-c-manning