opentelemetry-js
opentelemetry-js copied to clipboard
feat(instrumentation): implement `require-in-the-middle` singleton
Which problem is this PR solving?
This PR changes the behavior of instrumentation plugins to create a single require
patch, instead of one per plugin. This should be a significant performance improvement for projects using many instrumentation plugins.
Fixes https://github.com/open-telemetry/opentelemetry-js/issues/3029.
Short description of the changes
This PR creates a require-in-the-middle
singleton that looks for registered onRequire
functions for a given module
and executes them.
Type of change
- [X] New feature (non-breaking change which adds functionality)
How Has This Been Tested?
This PR is mostly covered by existing tests in InstrumentationBase.test.ts
. It adds unit tests for the logic in the _shouldHook
function.
Checklist:
- [X] Followed the style guidelines of this project
- [X] Unit tests have been added
- [ ] Documentation has been updated
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: mhassan1 / name: Marc Hassan (d5e4bf2ef5fde6bb53e6ee0be0ef40687a10591d)
Awesome, this sounds promising, thank you for the PR :))
Codecov Report
Merging #3161 (ed6e3ef) into main (0d4c71f) will increase coverage by
0.02%
. The diff coverage is96.82%
.
Additional details and impacted files
@@ Coverage Diff @@
## main #3161 +/- ##
==========================================
+ Coverage 93.44% 93.47% +0.02%
==========================================
Files 241 243 +2
Lines 7260 7321 +61
Branches 1507 1517 +10
==========================================
+ Hits 6784 6843 +59
- Misses 476 478 +2
Impacted Files | Coverage Δ | |
---|---|---|
...n/src/platform/node/RequireInTheMiddleSingleton.ts | 92.85% <92.85%> (ø) |
|
...nstrumentation/src/platform/node/ModuleNameTrie.ts | 100.00% <100.00%> (ø) |
|
...strumentation/src/platform/node/instrumentation.ts | 64.13% <100.00%> (+0.39%) |
:arrow_up: |
How can we progress this PR? I am really interested to speed up the start up of my projects now feels like it takes a few minutes
I am also interested in getting this PR across the finish line. We are receiving more complaints now that we have more and more instrumentations in auto-instrumentations packages (example: https://github.com/open-telemetry/opentelemetry-js/discussions/3229#discussioncomment-3578446)
@mhassan1 curious what is required for you to bring this PR into a state ready for review? I have been getting increasing questions about startup cost and this would be a huge win for us.
I've moved it out of Draft. I see some failing checks that I will address now.
Awesome @mhassan1 looking forward trying it out :)
I've cleaned up the PR and added more tests. This is ready for review.
Unit tests are failing. I will look into it.
With this change we are making a hard promise never to change the interface of this because of the global singleton. IMO we should upgrade this to 1.0
in order to make this promise explicit. @open-telemetry/javascript-maintainers wdyt
@mhassan1 please refrain from modifying history or force pushing. it makes it very difficult to track changes
Unit tests for @opentelemetry/instrumentation-http
are failing in the pipeline and on my machine. I am looking into it. It seems to be a timing issue between when the require-in-the-middle
singleton is initialized and when require('http')
is first called.
This PR seems to introduce a change in behavior that may not be acceptable without a fix (and this is the reason why the @opentelemetry/instrumentation-http
tests are failing).
Assume this scenario:
require('http')
new HttpInstrumentation()
require('http')
In the main
code, RITM gets set up for http
after new HttpInstrumentation()
, so it will get patched on the next call to require
.
In this PR's code, the singleton RITM will get set up before the first require
, and our onRequire
function will not find any matches for http
, since new HttpInstrumentation()
hasn't been called, yet; RITM will cache the result of our onRequire
, and the second require
will not result in our onRequire
being called, so http
will never be patched.
Is this change in behavior acceptable? It seems like it would be too big of a change to tell users "you must instantiate instrumentation before you require
a package, otherwise it will never get instrumented," since that seems to not be true, today.
We already tell users instrumentations must be instantiated first. Your example with http
only works because it is a system module. If you require a non-system module before initializing the instrumentation, then initialize, then require it again, you will have a cached module that is not instrumented. Might need to test to confirm that but i'm pretty sure
we even warn if you instantiate an instrumentation for a module that is already in the require cache
Maybe a reasonable solution would be to instantiate RITM only when the first instance of InstrumentationBase
is created.
With that change in place, I see another problem: @opentelemetry/instrumentation-http
test suites run sequentially in the same process, and each one does new HttpInstrumentation()
. If the first test suite ends with instrumentation.disable()
, the module patch will be removed (as expected), then the second suite's instrumentation.enable()
will have no effect because the onRequire
function will never run again.
What do you think is the right way to resolve that? Is it expected behavior for the user to create multiple instances of new HttpInstrumentation()
, some enabled and some disabled?
Disabling instrumentations is mostly meant as a testing feature, but it needs to work. What is happening in the new version that breaks it? Shouldn't we be able to modify the disable
method to remove it from the hook cache if needed?
We could figure out how to remove it from the RITM cache, although we would need to reach inside its internal cache to do that. Also, RITM caches by filename, which it doesn't provide to us (link), so we would need to figure out what filename
was based on moduleName
and basedir
so we can remove it from the cache. We would also need to remove all files from the RITM cache related to that instrumentation, so we would need to keep track of all those files.
The difference between main
and this branch is that main
creates a fresh RITM (with a fresh cache) each time new HttpInstrumentation()
is called, whereas this branch keeps a single RITM cache for all to share.
To be clear, disable
does what we expect (it removes the instrumentation patch); this is the scenario that does not work:
const instrumentation1 = new HttpInstrumentation()
instrumentation1.enable()
require('http') // patch added
instrumentation1.disable() // patch removed
instrumentation1.enable() // patch added again because of https://github.com/open-telemetry/opentelemetry-js/blob/c1ef97de/experimental/packages/opentelemetry-instrumentation/src/platform/node/instrumentation.ts#L144
instrumentation1.disable() // patch removed
const instrumentation2 = new HttpInstrumentation()
instrumentation2.enable() // patch not added again
I believe this scenario violates the "instrumentation must be applied before the first require
" principle, so maybe we need to modify the tests.
Ah I see what you mean. I think it's quite likely that many of the instrumentations in contrib will have broken tests then, and who knows about third party hosted instrumentations. This might just be something we have to grit our teeth and do once, because the performance benefit on startup is worth the pain of updating all the tests. @Rauno56 and @blumamir do most of the contrib maintenance so i'm curious about their opinions here.
This is a hack that gets the tests passing locally:
public disable(): void {
// ...
if (process.env.npm_lifecycle_script?.startsWith('nyc ts-mocha')) {
// @ts-expect-error TS2339: Property 'cache' does not exist on type 'Hooked'.
RequireInTheMiddleSingleton.getGlobalInstance()._hook?.cache.clear();
}
This works by clearing the entire RITM cache on every disable
. It accomplishes the goal because clearing the entire cache will somehow lead to require('http')
being called again.
I also have a POC of selectively clearing the RITM cache on disable
of a particular plugin. This doesn't actually solve the testing issue, since require('http')
is never called again, so the patch is never re-applied.
This is a hack that gets the tests passing locally:
I think before doing that we should move away from using require-in-the-middle or request an API for thins via an PR there.
Not sure if it has any influence here but please note also that it's perfectly fine that several independent instrumentations want to hook the same module. It's just luck that we have HttpInstrumentation
and not HttpServerInstrumentation
and HttpClientInstrumentation
.
This PR supports multiple instrumentations of the same modules, as long as they are all enabled before the first require
that gets hooked by RITM. The problem we are having now is that our test suites don't adhere to that rule because they all run in the same process.
One potential solution would be to clean up the RITM cache between test suites, which is what that hack does. Another potential solution would be to run each test suite in a fresh process. Another potential solution would be to modify the test suites themselves to share a single set of instrumentations, instead of creating new ones in each suite.
I've just pushed some improvements, including a not-so-hacky way to get the tests to pass without any test changes needed
I have run all tests locally in https://github.com/open-telemetry/opentelemetry-js-contrib against this branch of @opentelemetry/instrumentation
, and they all pass, except @opentelemetry/instrumentation-aws-lambda
; that one depends (link) on RITM's behavior related to absolute paths (link), which no longer works when we call RequireInTheMiddle(null, ...)
. It looks like @dyladan was involved in the PR that enabled absolute path support: https://github.com/elastic/require-in-the-middle/pull/45.
Regarding performance improvement, our large server (~10,000 unique require
calls) startup time looks like this:
- without instrumentation: ~20 sec
- with 9 instrumentation plugins (
main
): ~50 sec - with 9 instrumentation plugins (this branch): ~20 sec
I was indeed involved in that PR. It was required in order to instrument aws lambda using a layer.
Thanks for the research. Sounds like a huge performance improvement. Did you happen to step through to see if it is possible to work around the issue in the aws lambda or did you just run the test? If we're intercepting all require calls, shouldn't we be able to match on an exact path the same way ritm does?
I don't think there's anything that @opentelemetry/instrumentation-aws-lambda
can do as a workaround, since RITM will never call its onRequire
function for an absolute require
when null
is passed to RITM. We will need to pass something other than null
to RITM, or we will need a patch to RITM to support absolute require
s when null
is passed.
I also noticed that RITM passes name
and basedir
a little differently, in this scenario, so we will need to update our logic in RequireInTheMiddleSingleton
to handle it (in the current branch's logic, we don't use basedir
).
Personally, I think it’s worth it to revisit lambda in the future and merge this PR as it will benefit everyone that’s not using AWS Lambda’s. I can imagine AWS/Amazon can have a closer look at it :)
Yes, I am bias as I can’t use AWS in the cases I am using Opentelemetry but do have slow starts (feels to take minutes) when enabling instrumentation ☺️
I dismissed my review until I have time to review the updated logic. Don't want this to get merged accidentally before that based on my outdated ✅