HPCC-Platform icon indicating copy to clipboard operation
HPCC-Platform copied to clipboard

HPCC-31381 Forward trace summary values to Open Telemetry spans

Open timothyklemm opened this issue 10 months ago • 7 comments

Forward all trace summary values recorded using the CTxSummary class to the current thread's active trace span. Cumulative timers are forwarded at the end of each transaction, while all other values are forwarded immediately. While the trace summary avoids duplicated names, duplicate names will be forwarded.

Support Open Telemetry naming conventions by accepting a span attribute name for all values other than cumulative timers. This name is optional. The summary name is forwarded by default. All names are converted to snake case prior to forwarding.

Support Open Telemetry expected units of measurement by accepting a summary value scaling indicator for all integral values. This indicator is optional. No scaling is applied by default, with the exception of cumulative timer and timestamp values that are presumed to be milliseconds.

Encapsulate forwarding logic for portability to alternate ESP implementations.

Signed-off-by: Tim Klemm [email protected]

Type of change:

  • [ ] This change is a bug fix (non-breaking change which fixes an issue).
  • [x] This change is a new feature (non-breaking change which adds functionality).
  • [ ] This change improves the code (refactor or other change that does not change the functionality)
  • [ ] This change fixes warnings (the fix does not alter the functionality or the generated code)
  • [ ] This change is a breaking change (fix or feature that will cause existing behavior to change).
  • [ ] This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • [x] My code follows the code style of this project.
    • [ ] My code does not create any new warnings from compiler, build system, or lint.
  • [x] The commit message is properly formatted and free of typos.
    • [ ] The commit message title makes sense in a changelog, by itself.
    • [ ] The commit is signed.
  • [ ] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly, or...
    • [ ] I have created a JIRA ticket to update the documentation.
    • [ ] Any new interfaces or exported functions are appropriately commented.
  • [x] I have read the CONTRIBUTORS document.
  • [x] The change has been fully tested:
    • [ ] I have added tests to cover my changes.
    • [ ] All new and existing tests passed.
    • [ ] I have checked that this change does not introduce memory leaks.
    • [ ] I have used Valgrind or similar tools to check for potential issues.
  • [x] I have given due consideration to all of the following potential concerns:
    • [ ] Scalability
    • [ ] Performance
    • [ ] Security
    • [ ] Thread-safety
    • [ ] Cloud-compatibility
    • [ ] Premature optimization
    • [ ] Existing deployed queries will not be broken
    • [ ] This change fixes the problem, not just the symptom
    • [ ] The target branch of this pull request is appropriate for such a change.
  • [x] There are no similar instances of the same problem that should be addressed
    • [ ] I have addressed them here
    • [ ] I have raised JIRA issues to address them separately
  • [ ] This is a user interface / front-end modification
    • [ ] I have tested my changes in multiple modern browsers
    • [ ] The component(s) render as expected

Smoketest:

  • [ ] Send notifications about my Pull Request position in Smoketest queue.
  • [ ] Test my draft Pull Request.

Testing:

timothyklemm avatar Apr 19 '24 13:04 timothyklemm

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-31381

Jirabot Action Result: Workflow Transition: Merge Pending Updated PR

github-actions[bot] avatar Apr 19 '24 13:04 github-actions[bot]

@timothyklemm I don't know why, but no ECL code can run on the platform build from this PR. I can't find any error or crash message in log files or core file. It seems the whole platform stalled somewhere. All components are up and running.

AttilaVamos avatar Apr 22 '24 16:04 AttilaVamos

@AttilaVamos I noticed. I also noticed that other PRs aren't failing as consistently. I don't have an explanation yet, which is why I haven't asked for any reviews.

timothyklemm avatar Apr 22 '24 16:04 timothyklemm

@timothyklemm It seems fine now, thanks.

AttilaVamos avatar Apr 22 '24 18:04 AttilaVamos

I missed the threaded active span functionality, which can improve this implementation. Will submit an updated version.

timothyklemm avatar Apr 23 '24 12:04 timothyklemm

I think one of the main concerns has been the TxSummary lifespan is different than the active span, are we ensuring all TxSummary attributes are being forwarded/set on the current span while it is active?

With the exception of cumulative timer values, values added to the trace summary are forwarded to a span before the summary stores the value for its own use. Timers can't logically be forwarded before the end of the transaction, thus targeting the root server span, because the internal timer object is exposed to the ESP for updates at any time during the transaction.

This solution does not cache attributes prior to the existence of an active span. Both the ESP context and the HTTP server add values to the trace summary before creating the root server span, and these values are not forwarded. But the ESP context will reset the values it adds in its constructor before logging the summary, so those values will be forwarded then.

timothyklemm avatar Apr 25 '24 18:04 timothyklemm

On my radar - I am planning to take a look next week.

ghalliday avatar Apr 26 '24 13:04 ghalliday

@ghalliday this has been waiting for feedback after my last update. Now the feedback has me questioning if I understand the problem I'm attempting to solve.

timothyklemm avatar Jun 17 '24 12:06 timothyklemm

Will no longer attempt to reuse instrumentation. TxSummary will remain unchanged for now. Values we want in spans will be added directly to the appropriate span.

timothyklemm avatar Jun 27 '24 14:06 timothyklemm