apm-agent-python icon indicating copy to clipboard operation
apm-agent-python copied to clipboard

Make OT binding support explicit timestamps.

Open AndreiPashkin opened this issue 5 years ago • 15 comments

What does this pull request do?

OpenTracing API allows to explicitly set start and end timestamps of spans but the implementation in this project does not allow that. That patch fixes it.

Closes #1126

AndreiPashkin avatar May 08 '20 20:05 AndreiPashkin

💚 CLA has been signed

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-10-13T16:27:38.163+0000

  • Duration: 31 min 37 sec

  • Commit: d16b3cc19033da48037cb31b1ef44d5decb690de

Test stats :test_tube:

Test Results
Failed 0
Passed 10007
Skipped 8977
Total 18984

:green_heart: Flaky test report

Tests succeeded.

:robot: GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

apmmachine avatar May 08 '20 20:05 apmmachine

jenkins run the tests please

basepi avatar May 08 '20 21:05 basepi

Also, @AndreiPashkin, please sign the CLA when you get a chance.

basepi avatar May 08 '20 21:05 basepi

Also, @AndreiPashkin, please sign the CLA when you get a chance.

Hmm, but I've signed it, here is the PDF: https://www.dropbox.com/s/oufsg1mc17hlqvy/Please_DocuSign_Individual_Contributor_Licen%20%281%29.pdf?dl=0

Why doesn't the check turn green?

I'm also puzzled - how can I run tests so that I can use PDB for debugging locally? Could you please advise? I'd like to add one test case.

AndreiPashkin avatar May 09 '20 05:05 AndreiPashkin

Looks like CLA went green, maybe it was delayed.

I'm also puzzled - how can I run tests so that I can use PDB for debugging locally?

I don't tend to use PDB, but running the tests should be as simple as installing the requirements file that is most relevant (at minimum, reqs-base.txt), and then run py.test to run all the tests. I'm guessing PDB can attach to pytest runs without issue -- most of my debugging is either via logging or via the VSCode debugger.

basepi avatar May 11 '20 17:05 basepi

Let me actually write a test.

AndreiPashkin avatar May 13 '20 10:05 AndreiPashkin

@basepi, @beniwohli, I've observed some strange when tested my changes.

Here is my test-script:

import logging
import sys
import time


from elasticapm import Client
from elasticapm.contrib.opentracing import Tracer


client = Client({
    'SERVICE_NAME': 'test-with_changes',
    'SERVER_URL': 'http://localhost:8888'
})
tracer = Tracer(client_instance=client)

logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)

START = time.time()

span_1 = tracer.start_span('parent', start_time=START)
span_2 = tracer.start_span('child', child_of=span_1, start_time=START + 2)
span_2.finish(finish_time=START + 10)
span_1.finish(finish_time=START + 15)

When I install the latest release with pip install -I elastic-apm and execute it - I see such a picture: image Both parent and child are visible but they have incorrect timestamps recorded (and making it recorded correctly is the purpose of this PR).

When I install my changes (pip install -I https://github.com/AndreiPashkin/apm-agent-python/archive/explicit-timestamps-opentracing.zip) and re-run my script (changing the service name to better observe the differences) - I see the following: image

  1. Only the parent is seen.
  2. Timestamps have gigantic values.

I've also captured payload that was sent to the Elastic APM instance in both cases.

Without changes:

{"metadata": {"service": {"name": "test-without_changes", "environment": null, "version": null, "agent": {"name": "python", "version": "5.6.0"}, "language": {"name": "python", "version": "3.7.7"}, "runtime": {"name": "CPython", "version": "3.7.7"}}, "process": {"pid": 30721, "ppid": 2071, "argv": ["test_elastic.py"], "title": null}, "system": {"hostname": "andrei-yoga514", "architecture": "x86_64", "platform": "linux"}}}
{"span": {"id": "da6e65e8787355b7", "transaction_id": "f5c2f6d6740a0cd4", "trace_id": "4366c9a4a3cbc6c54abbf84e10d4d8a5", "parent_id": "f5c2f6d6740a0cd4", "name": "child", "type": "code", "subtype": "custom", "action": null, "sync": true, "timestamp": 1589642748818084, "duration": 0.07035498856566846}}
{"transaction": {"id": "f5c2f6d6740a0cd4", "trace_id": "4366c9a4a3cbc6c54abbf84e10d4d8a5", "name": "parent", "type": "custom", "duration": 0.24842699349392205, "result": "", "timestamp": 1589642748818004, "sampled": true, "span_count": {"started": 1, "dropped": 0}, "context": {"tags": {}}}}

With changes:

{"metadata": {"service": {"name": "test-with_changes", "environment": null, "version": null, "agent": {"name": "python", "version": "5.6.0"}, "language": {"name": "python", "version": "3.7.7"}, "runtime": {"name": "CPython", "version": "3.7.7"}}, "process": {"pid": 31801, "ppid": 2071, "argv": ["test_elastic.py"], "title": null}, "system": {"hostname": "andrei-yoga514", "architecture": "x86_64", "platform": "linux"}}}
{"span": {"id": "a1a2d1d77c17595e", "transaction_id": "0af7966685c1e27a", "trace_id": "7adccfb5c60bad87b80e290334e6ee53", "parent_id": "0af7966685c1e27a", "name": "child", "type": "code", "subtype": "custom", "action": null, "sync": true, "timestamp": 3179173112604596, "duration": 8000.0, "stacktrace": [{"abs_path": "test_elastic.py", "filename": "test_elastic.py", "module": "__main__", "function": "<module>", "lineno": 21, "library_frame": false}]}}
{"transaction": {"id": "0af7966685c1e27a", "trace_id": "7adccfb5c60bad87b80e290334e6ee53", "name": "parent", "type": "custom", "duration": 1589530226905.7947, "result": "", "timestamp": 1589642898698801, "sampled": true, "span_count": {"started": 1, "dropped": 0}, "context": {"tags": {}}}}

Does anyone have any ideas about what could be wrong?

AndreiPashkin avatar May 16 '20 15:05 AndreiPashkin

@beniwohli, can you also please explain this code: http://github.com/elastic/apm-agent-python/blob/master/elasticapm/traces.py#L140-L140

Why is start provided both self.timestamp and self.start_time are assigned to it but if start is not provided self.timestamp is assigned epoch timestamp and self.start_time is assigned result of time.perf_time() which is usually equals to time.monotonic()?

AndreiPashkin avatar May 16 '20 16:05 AndreiPashkin

@beniwohli

it's not bulletproof as far as I can tell (e.g. the system clock could jump right after you measured the delta, but before applying it to the non-monotonic timestamp).

But if system clock is jumping then the agent is going to send skewed data in any case - so there is no bullet-proofness against that in current master already and my PR does not change that. TimestampConversionUtil simply converts timestamps between clocks without adding or removing anything in regards to protection against clock jumps.

As such, I'm a bit wary to introduce this into a major code path of the agent. However, I'd be open to it if you moved the usage of TimestampConversionUtil to the opentracing bridge.

Should I add another private argument for system-clock timestamp and pass it from the OT bridge along with the monotonic timestamp?

AndreiPashkin avatar Jun 06 '20 09:06 AndreiPashkin

@beniwohli Could you please reply on my prev. comment?

AndreiPashkin avatar Aug 26 '20 09:08 AndreiPashkin

:green_heart: Flaky test report

Tests succeeded.

Expand to view the summary

Test stats :test_tube:

Test Results
Failed 0
Passed 9189
Skipped 6837
Total 16026

apmmachine avatar Oct 27 '20 17:10 apmmachine

@basepi, @beniwohli, could you please look into it?

AndreiPashkin avatar Mar 17 '21 11:03 AndreiPashkin

@beniwohli As far as I can tell, the new TimestampConversionUtil is only used within the OT bridge, and even then only if a start time is defined, so I don't think this is actually going to introduce that additional complexity across core transaction/span duration measurements.

For my sanity, @AndreiPashkin, can you give me the tl;dr of why that class is necessary? Are we just trying to avoid the unlikely, but possible chance that we could measure the difference between the clocks at the point of a daylight saving change, and thus convert with an hour mistake in the duration? Are there other examples of jumps in non-monotonic clocks that we're trying to detect/prevent?

basepi avatar Jun 10 '21 17:06 basepi

Hey @AndreiPashkin -- I'd still like to get this in. I ended up writing a similar (but a bit simpler) timestamp conversion util for some AWS lambda work I was doing. It was just merged in #1366. Can you merge from upstream/master and switch to that new util? We can then get this merged.

Thanks!

basepi avatar Oct 29 '21 14:10 basepi

Closing this due to lack of reply.

basepi avatar Aug 25 '22 18:08 basepi