HPCC-Platform icon indicating copy to clipboard operation
HPCC-Platform copied to clipboard

HPCC-31353 Report the slowest 5 activies in the roxie complete line

Open ghalliday opened this issue 11 months ago • 5 comments

Type of change:

  • [ ] This change is a bug fix (non-breaking change which fixes an issue).
  • [x] This change is a new feature (non-breaking change which adds functionality).
  • [ ] This change improves the code (refactor or other change that does not change the functionality)
  • [ ] This change fixes warnings (the fix does not alter the functionality or the generated code)
  • [ ] This change is a breaking change (fix or feature that will cause existing behavior to change).
  • [ ] This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • [x] My code follows the code style of this project.
    • [x] My code does not create any new warnings from compiler, build system, or lint.
  • [x] The commit message is properly formatted and free of typos.
    • [x] The commit message title makes sense in a changelog, by itself.
    • [x] The commit is signed.
  • [ ] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly, or...
    • [ ] I have created a JIRA ticket to update the documentation.
    • [ ] Any new interfaces or exported functions are appropriately commented.
  • [x] I have read the CONTRIBUTORS document.
  • [x] The change has been fully tested:
    • [ ] I have added tests to cover my changes.
    • [ ] All new and existing tests passed.
    • [ ] I have checked that this change does not introduce memory leaks.
    • [ ] I have used Valgrind or similar tools to check for potential issues.
  • [ ] I have given due consideration to all of the following potential concerns:
    • [ ] Scalability
    • [ ] Performance
    • [ ] Security
    • [ ] Thread-safety
    • [ ] Cloud-compatibility
    • [ ] Premature optimization
    • [ ] Existing deployed queries will not be broken
    • [ ] This change fixes the problem, not just the symptom
    • [ ] The target branch of this pull request is appropriate for such a change.
  • [ ] There are no similar instances of the same problem that should be addressed
    • [ ] I have addressed them here
    • [ ] I have raised JIRA issues to address them separately
  • [ ] This is a user interface / front-end modification
    • [ ] I have tested my changes in multiple modern browsers
    • [ ] The component(s) render as expected

Smoketest:

  • [ ] Send notifications about my Pull Request position in Smoketest queue.
  • [ ] Test my draft Pull Request.

Testing:

ghalliday avatar Feb 27 '24 13:02 ghalliday

https://track.hpccsystems.com/browse/HPCC-31353 Jira updated

github-actions[bot] avatar Feb 27 '24 13:02 github-actions[bot]

Pushed for discussion (although I think it could be merged). I don't particularly like the way that an extra parameter is needed to be able to record the ids, but a more general approach would be less efficient and I suspect the stats merging should be re-examined. Questions: Should the number of activities be optional/configurable (probably relatively easy). Should I keep the number of activities? It would save a few compares, but complicate the code.

See jira for sample output.

ghalliday avatar Feb 27 '24 13:02 ghalliday

My main concern is what is the performance impact of gathering this information. mergeStats may be called a lot, especially in a child query scenario. Is the information going to be useful enough, often enough?

Not sure I would recommend allowing configuration of the number, as picking a higher number will negatively impact the performance more

richardkchapman avatar Feb 28 '24 10:02 richardkchapman

Conclusions from discussion:

  • It would be better to only do this if the query was above a certain threshold/SLA
  • Even better would be to generate a stats workunit for a query which exceeded the SLA - with a limit of only doing it once every minute/5 minutes.

ghalliday avatar Feb 28 '24 11:02 ghalliday

@richardkchapman I have added some timing tests. It has an impact of ~1ns for every activity that isn't in the top 5 and about 10ns for each activity that is. So a impact of ~50us for a very complex query. That is almost certainly lower than the impact of aggregating the rest of the stats. It is only called when the final results are aggregated, so child queries etc. will not impact it.
After further reflection I think it is worthwhile because many roxie queries are not soapcall bound, and this provides some useful debugging information when there is a problem - full stats are much better when rerunning the query.

Thoughts/opinions? @mckellyln renamed slow to slowest, rebased and squashed.

ghalliday avatar May 08 '24 14:05 ghalliday

@mckellyln would this be better to only report if the slowest activity was above a certain threshold (e.g., 10ms), or is it always useful?

ghalliday avatar May 15 '24 16:05 ghalliday

@ghalliday yes - I think good idea to skip this if slowest activity was less than some configurable threshold (10 ms default).

mckellyln avatar May 15 '24 17:05 mckellyln

Added an extra guard condition, ignoring all activities < 10ms. (Compare ignoring case.)

ghalliday avatar May 28 '24 14:05 ghalliday