Which problem is this PR solving?

This PR changes the behavior of instrumentation plugins to create a single require patch, instead of one per plugin. This should be a significant performance improvement for projects using many instrumentation plugins.

Fixes https://github.com/open-telemetry/opentelemetry-js/issues/3029.

Short description of the changes

This PR creates a require-in-the-middle singleton that looks for registered onRequire functions for a given module and executes them.

Type of change

[X] New feature (non-breaking change which adds functionality)

How Has This Been Tested?

This PR is mostly covered by existing tests in InstrumentationBase.test.ts. It adds unit tests for the logic in the _shouldHook function.

Checklist:

[X] Followed the style guidelines of this project
[X] Unit tests have been added
[ ] Documentation has been updated

Aug 11 '22 19:08 mhassan1

The committers listed above are authorized under a signed CLA.

:white_check_mark: login: mhassan1 / name: Marc Hassan (d5e4bf2ef5fde6bb53e6ee0be0ef40687a10591d)

Aug 11 '22 19:08 linux-foundation-easycla[bot]

Awesome, this sounds promising, thank you for the PR :))

Aug 11 '22 20:08 weyert

Codecov Report

Merging #3161 (ed6e3ef) into main (0d4c71f) will increase coverage by 0.02%. The diff coverage is 96.82%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3161      +/-   ##
==========================================
+ Coverage   93.44%   93.47%   +0.02%     
==========================================
  Files         241      243       +2     
  Lines        7260     7321      +61     
  Branches     1507     1517      +10     
==========================================
+ Hits         6784     6843      +59     
- Misses        476      478       +2

Impacted Files	Coverage Δ
...n/src/platform/node/RequireInTheMiddleSingleton.ts	`92.85% <92.85%> (ø)`
...nstrumentation/src/platform/node/ModuleNameTrie.ts	`100.00% <100.00%> (ø)`
...strumentation/src/platform/node/instrumentation.ts	`64.13% <100.00%> (+0.39%)`	:arrow_up:

Aug 12 '22 02:08 codecov[bot]

How can we progress this PR? I am really interested to speed up the start up of my projects now feels like it takes a few minutes

Aug 26 '22 01:08 weyert

I am also interested in getting this PR across the finish line. We are receiving more complaints now that we have more and more instrumentations in auto-instrumentations packages (example: https://github.com/open-telemetry/opentelemetry-js/discussions/3229#discussioncomment-3578446)

Sep 06 '22 18:09 dyladan

@mhassan1 curious what is required for you to bring this PR into a state ready for review? I have been getting increasing questions about startup cost and this would be a huge win for us.

Sep 09 '22 21:09 dyladan

I've moved it out of Draft. I see some failing checks that I will address now.

Sep 09 '22 21:09 mhassan1

Awesome @mhassan1 looking forward trying it out :)

Sep 09 '22 21:09 weyert

I've cleaned up the PR and added more tests. This is ready for review.

Sep 10 '22 03:09 mhassan1

Unit tests are failing. I will look into it.

Sep 11 '22 02:09 mhassan1

With this change we are making a hard promise never to change the interface of this because of the global singleton. IMO we should upgrade this to 1.0 in order to make this promise explicit. @open-telemetry/javascript-maintainers wdyt

Sep 12 '22 13:09 dyladan

@mhassan1 please refrain from modifying history or force pushing. it makes it very difficult to track changes

Sep 12 '22 15:09 dyladan

Unit tests for @opentelemetry/instrumentation-http are failing in the pipeline and on my machine. I am looking into it. It seems to be a timing issue between when the require-in-the-middle singleton is initialized and when require('http') is first called.

Sep 12 '22 15:09 mhassan1

This PR seems to introduce a change in behavior that may not be acceptable without a fix (and this is the reason why the @opentelemetry/instrumentation-http tests are failing).

Assume this scenario:

require('http')
new HttpInstrumentation()
require('http')

In the main code, RITM gets set up for http after new HttpInstrumentation(), so it will get patched on the next call to require.

In this PR's code, the singleton RITM will get set up before the first require, and our onRequire function will not find any matches for http, since new HttpInstrumentation() hasn't been called, yet; RITM will cache the result of our onRequire, and the second require will not result in our onRequire being called, so http will never be patched.

Is this change in behavior acceptable? It seems like it would be too big of a change to tell users "you must instantiate instrumentation before you require a package, otherwise it will never get instrumented," since that seems to not be true, today.

Sep 12 '22 16:09 mhassan1

We already tell users instrumentations must be instantiated first. Your example with http only works because it is a system module. If you require a non-system module before initializing the instrumentation, then initialize, then require it again, you will have a cached module that is not instrumented. Might need to test to confirm that but i'm pretty sure

Sep 12 '22 16:09 dyladan

we even warn if you instantiate an instrumentation for a module that is already in the require cache

Sep 12 '22 16:09 dyladan

Maybe a reasonable solution would be to instantiate RITM only when the first instance of InstrumentationBase is created.

With that change in place, I see another problem: @opentelemetry/instrumentation-http test suites run sequentially in the same process, and each one does new HttpInstrumentation(). If the first test suite ends with instrumentation.disable(), the module patch will be removed (as expected), then the second suite's instrumentation.enable() will have no effect because the onRequire function will never run again.

What do you think is the right way to resolve that? Is it expected behavior for the user to create multiple instances of new HttpInstrumentation(), some enabled and some disabled?

Sep 12 '22 17:09 mhassan1

Disabling instrumentations is mostly meant as a testing feature, but it needs to work. What is happening in the new version that breaks it? Shouldn't we be able to modify the disable method to remove it from the hook cache if needed?

Sep 12 '22 17:09 dyladan

We could figure out how to remove it from the RITM cache, although we would need to reach inside its internal cache to do that. Also, RITM caches by filename, which it doesn't provide to us (link), so we would need to figure out what filename was based on moduleName and basedir so we can remove it from the cache. We would also need to remove all files from the RITM cache related to that instrumentation, so we would need to keep track of all those files.

The difference between main and this branch is that main creates a fresh RITM (with a fresh cache) each time new HttpInstrumentation() is called, whereas this branch keeps a single RITM cache for all to share.

Sep 12 '22 18:09 mhassan1

To be clear, disable does what we expect (it removes the instrumentation patch); this is the scenario that does not work:

const instrumentation1 = new HttpInstrumentation()
instrumentation1.enable()
require('http') // patch added
instrumentation1.disable() // patch removed
instrumentation1.enable() // patch added again because of https://github.com/open-telemetry/opentelemetry-js/blob/c1ef97de/experimental/packages/opentelemetry-instrumentation/src/platform/node/instrumentation.ts#L144
instrumentation1.disable() // patch removed

const instrumentation2 = new HttpInstrumentation()
instrumentation2.enable() // patch not added again

I believe this scenario violates the "instrumentation must be applied before the first require" principle, so maybe we need to modify the tests.

Sep 12 '22 18:09 mhassan1

Ah I see what you mean. I think it's quite likely that many of the instrumentations in contrib will have broken tests then, and who knows about third party hosted instrumentations. This might just be something we have to grit our teeth and do once, because the performance benefit on startup is worth the pain of updating all the tests. @Rauno56 and @blumamir do most of the contrib maintenance so i'm curious about their opinions here.

Sep 12 '22 20:09 dyladan

This is a hack that gets the tests passing locally:

public disable(): void {
  // ...
  if (process.env.npm_lifecycle_script?.startsWith('nyc ts-mocha')) {
    // @ts-expect-error TS2339: Property 'cache' does not exist on type 'Hooked'.
    RequireInTheMiddleSingleton.getGlobalInstance()._hook?.cache.clear();
  }

This works by clearing the entire RITM cache on every disable. It accomplishes the goal because clearing the entire cache will somehow lead to require('http') being called again.

I also have a POC of selectively clearing the RITM cache on disable of a particular plugin. This doesn't actually solve the testing issue, since require('http') is never called again, so the patch is never re-applied.

Sep 13 '22 00:09 mhassan1

This is a hack that gets the tests passing locally:

I think before doing that we should move away from using require-in-the-middle or request an API for thins via an PR there.

Not sure if it has any influence here but please note also that it's perfectly fine that several independent instrumentations want to hook the same module. It's just luck that we have HttpInstrumentation and not HttpServerInstrumentation and HttpClientInstrumentation.

Sep 13 '22 12:09 Flarna

This PR supports multiple instrumentations of the same modules, as long as they are all enabled before the first require that gets hooked by RITM. The problem we are having now is that our test suites don't adhere to that rule because they all run in the same process.

One potential solution would be to clean up the RITM cache between test suites, which is what that hack does. Another potential solution would be to run each test suite in a fresh process. Another potential solution would be to modify the test suites themselves to share a single set of instrumentations, instead of creating new ones in each suite.

Sep 13 '22 13:09 mhassan1

I've just pushed some improvements, including a not-so-hacky way to get the tests to pass without any test changes needed

Sep 16 '22 15:09 mhassan1

I have run all tests locally in https://github.com/open-telemetry/opentelemetry-js-contrib against this branch of @opentelemetry/instrumentation, and they all pass, except @opentelemetry/instrumentation-aws-lambda; that one depends (link) on RITM's behavior related to absolute paths (link), which no longer works when we call RequireInTheMiddle(null, ...). It looks like @dyladan was involved in the PR that enabled absolute path support: https://github.com/elastic/require-in-the-middle/pull/45.

Regarding performance improvement, our large server (~10,000 unique require calls) startup time looks like this:

without instrumentation: ~20 sec
with 9 instrumentation plugins (main): ~50 sec
with 9 instrumentation plugins (this branch): ~20 sec

Sep 20 '22 18:09 mhassan1

I was indeed involved in that PR. It was required in order to instrument aws lambda using a layer.

Thanks for the research. Sounds like a huge performance improvement. Did you happen to step through to see if it is possible to work around the issue in the aws lambda or did you just run the test? If we're intercepting all require calls, shouldn't we be able to match on an exact path the same way ritm does?

Sep 20 '22 19:09 dyladan

I don't think there's anything that @opentelemetry/instrumentation-aws-lambda can do as a workaround, since RITM will never call its onRequire function for an absolute require when null is passed to RITM. We will need to pass something other than null to RITM, or we will need a patch to RITM to support absolute requires when null is passed.

I also noticed that RITM passes name and basedir a little differently, in this scenario, so we will need to update our logic in RequireInTheMiddleSingleton to handle it (in the current branch's logic, we don't use basedir).

Sep 20 '22 19:09 mhassan1

Personally, I think it’s worth it to revisit lambda in the future and merge this PR as it will benefit everyone that’s not using AWS Lambda’s. I can imagine AWS/Amazon can have a closer look at it :)

Yes, I am bias as I can’t use AWS in the cases I am using Opentelemetry but do have slow starts (feels to take minutes) when enabling instrumentation ☺️

Sep 20 '22 20:09 weyert

I dismissed my review until I have time to review the updated logic. Don't want this to get merged accidentally before that based on my outdated ✅

Sep 22 '22 12:09 dyladan

opentelemetry-js
opentelemetry-js copied to clipboard

feat(instrumentation): implement `require-in-the-middle` singleton

Which problem is this PR solving?

Short description of the changes

Type of change

How Has This Been Tested?

Checklist:

Codecov Report

opentelemetry-js opentelemetry-js copied to clipboard

feat(instrumentation): implement `require-in-the-middle` singleton

Which problem is this PR solving?

Short description of the changes

Type of change

How Has This Been Tested?

Checklist:

Codecov Report

opentelemetry-js
opentelemetry-js copied to clipboard