prime-reportstream icon indicating copy to clipboard operation
prime-reportstream copied to clipboard

Refactor default httpclient as a singleton

Open thetaurean opened this issue 1 year ago • 14 comments

User Story

As a dev, I want to use a singleton for shared & reused objects so that the codebase remains readable and maintainable.

Description/Use Case

In #11928 we introduced an HTTPClient that is used throughout the project for making RESTful and other HTTP calls. That client can be specified by way of a parameter to various functions / class constructors in the project. But the parameter is only used for testing. Runtime code is passing in null for the client, which causes HTTPClientUtils to create a new default client for every request.

Instead, this client should be instantiated as a singleton and perhaps an inherited member or function of the BaseEngine.

Risks/Impacts/Considerations

  • Be sure to look at the duplicated code in RESTTransport.kt

Dev Notes

  • Retrieval or instantiation of the client should be mockable to facilitate simple testing

Acceptance Criteria

  • [ ] No more HTTPClient parameters everywhere
  • [ ] Singleton reusable HTTPClient for majority of use casees
  • [ ] Can still create one-time use HTTPClients for edge cases
    • Most (all?) cases can just configure the request and override singleton HTTPClient settings.
  • [ ] All clients are closed when done

thetaurean avatar Apr 04 '24 15:04 thetaurean

Hey team! Please add your planning poker estimate with Zenhub @brick-green @david-navapbc @jack-h-wang @jalbinson @JFisk42 @mkalish @thetaurean

arnejduranovic avatar Apr 15 '24 16:04 arnejduranovic

I've refactored HttpClientUtils.getDefaultHttpClient to return a singleton instead of a new instance. I've also refactored invoke() to include an "also" call to close the client after use.

The ticket speaks to RESTTrasport having duplicate code and indeed it does. The ACs don't speak specifically to migrating that code so I posted a thing on slack asking after it. If there's no objections I'm going to migrate that code to HttoClientUtils and ensure all calls to it are properly re-routed.

Lastly I still need to look into how the clients are being used. The current call to close() is only happening if the caller is using the invoke() method inside HttpClientUtilts or in RESTTransport

david-navapbc avatar Jul 02 '24 23:07 david-navapbc

I think I've run to ground all non-test references to HttpClientUtils. As far as I can tell literally none of them are being created with an httpClient object and thus are are all using the singleton.

Here is the rundown:

  • LookupTableCommands

    • no references found activating the constructor with anything other than a null value for HttpClient
    • all calls appear to use the singleton http client
  • OktaCommands

    • HttpClient is not passed as a constructor argument anywhere this is made
    • all calls appear to use the singleton http client
  • GAENTransport

    • HttpClient is not passed as a constructor argument anywhere this is made
    • all calls appear to use the singleton http client
  • SenderFilesCommand

    • all calls appear to use the singleton http client
  • EmailEngineFunction

    • all calls appear to use the singleton http client
  • SettingCommands

    • all calls appear to use the singleton http client
  • CommandUtilities

    • this one is an odd duck. I'm don't see any clients being created outside of the singleton.
  • HashicorpVaultCredentialService

    • all calls appear to use the singleton http client

as far as the ACs are concerned - I'm not sure what's meant by

No more HTTPClient parameters everywhere

The parameters I'm seeing are ones that specify things like url or request payload or credentials and are thus necessary.

david-navapbc avatar Jul 08 '24 00:07 david-navapbc

Screenshot 2024-07-08 at 7.38.45 PM.png

I don't know why but server2server smoke tests are failing in this branch only. Things are working properly in master. Somehow turning the httpclient in HttpClientUtils has either broken that test or broken the auth flow. I don't see how either could have possibly occurred but something weird is definitely happening and rn my money is on the test being dorked up vs the actual production code.

Everything else seems to be working. Once I run to ground why this is happening I'll put in a PR.

david-navapbc avatar Jul 09 '24 04:07 david-navapbc

not fixed yet. here's where things are at:

  • I verified that the same smoke tests are failing in the ci/cd pipeline so it's not a local issue
  • I've been unable to trace the logic to see what's going on or why
  • I pivoted to selectively undoing the changes I made to HttpClientUtils to see if I could find which of them caused the issue. There are three changes I made. I've eliminated two.
    • the [at]Volitile annotation was not the culprit
    • closing the client in the invoke method was not the culprit
    • something about the manner in which I set up the singleton is causing the issue. I will investigate further tomorrow.

david-navapbc avatar Jul 10 '24 06:07 david-navapbc

I have a working theory.

Inside the default client creation method is a check to see if the caller has supplied an auth argument. If they have, the auth argument is baked into the client as a default header used for all future requests.

Once we make the httpclient object a singleton, whatever the auth setting was at time of client creation is the auth setting from then on. The only way to add it after the fact is to create a new client.

see

https://api.ktor.io/ktor-client/ktor-client-core/io.ktor.client/-http-client/index.html

and

https://api.ktor.io/ktor-client/ktor-client-core/io.ktor.client/-http-client/config.html

I am noodling around with some things but if I'm right then the path forward is going to be to have the singleton object be sans auth header and that will live in HttpClientUtils. When the client specifies they want a client with auth we'll use the "config" call to generate a new client with that auth token.

I think the way we are presently using all of this will result in the auth-generated client to be closed because we're still using HttpClientUtils.invoke() which has the "also" hook at the end where the client is closed.

david-navapbc avatar Jul 11 '24 00:07 david-navapbc

Refactor complete. Functionally here's what's up

  • if the caller requests the default client sans auth (as in - doesn't pass an auth token) they get the httpClient singleton.
  • if the caller requests the default client with auth (as in - passes an auth token) then a new method is called that checks the provided auth token's hash value against one stored in the HttpClientUtils obj. There's no way I can find to inspect the httpClient object and extract that information, hence the need for a hashed and cached copy. If the hashes don't match, a new client is created with the new token and the cached hash is updated. The idea is to allow series of auth calls that use the same credential to use the same client. Only when a client needs to make a call with different credentials is a new client created.
  • the members and methods that handle this are written in a thread-safe manner. regardless of what kind of client (auth or sans auth) the caller requests, there's no way for subsequent callers to muck around with what the first caller got back.

All existing tests are passing. I need to write some new ones to cover the new behavior. Once that's complete it's PR time.

david-navapbc avatar Jul 14 '24 21:07 david-navapbc

made all updates as per reviewer request. awaiting final approval

david-navapbc avatar Jul 18 '24 02:07 david-navapbc

refactored things to not use use{...} and instead put the close() blocks back in. tl'dr - something about use{...} breaks our ability to reuse the client object.

see this for some breadcrumbs

https://stackoverflow.com/questions/65782244/kotlin-multiplatform-jobcancellationexception-parent-job-is-completed

ty @JFisk42 for being awesome as always and helping me t/s this

david-navapbc avatar Jul 24 '24 23:07 david-navapbc

the reason the original PR was passing, as was the latest iteration where "close" appeared to be working - is due to a bug. Props to @arnejduranovic for finding it.

the call }.also { httpClient?.close() }

should have been

}.also { (httpClient ?: getDefaultHttpClient()).close() }

because in the former, the attempt to close the httpClient, in the usual case of a null client being passed to the invocation methods, is being applied to the null httpClient argument and NOT the default httpclient provided by the HttpClientUtils class. Once you attempt to close THAT client things go boom.

upon further investigation I noted in the ktor manual the following:

Note that the close function prohibits creating new requests but doesn't terminate currently active ones. Resources will only be released after all client requests are completed.

If you need to use HttpClient for a single request, call the use function, which automatically calls close after executing the code block

https://ktor.io/docs/client-create-and-configure.html#close-client

I also followed the code to see what "use" and "close was actually doing. See the following in order:

  • https://github.com/ktorio/ktor/blob/86ca00c0a223e91a4d6001f291b8032ddfb79678/ktor-io/js/src/io/ktor/utils/io/core/Closeable.js.kt#L6
  • https://github.com/ktorio/ktor/blob/86ca00c0a223e91a4d6001f291b8032ddfb79678/ktor-io/common/src/io/ktor/utils/io/core/Closeable.kt#L8
  • https://github.com/ktorio/ktor/blob/86ca00c0a223e91a4d6001f291b8032ddfb79678/ktor-client/ktor-client-core/common/src/io/ktor/client/engine/HttpClientEngine.kt#L116
  • https://github.com/ktorio/ktor/blob/86ca00c0a223e91a4d6001f291b8032ddfb79678/ktor-client/ktor-client-core/common/src/io/ktor/client/engine/HttpClientEngineBase.kt#L18
  • https://github.com/ktorio/ktor/blob/86ca00c0a223e91a4d6001f291b8032ddfb79678/ktor-client/ktor-client-apache/jvm/src/io/ktor/client/engine/apache/ApacheEngine.kt#L22
  • https://github.com/apache/httpcomponents-client/blob/97247f0c8ce248377b1e4a85bba5cff7f11b0fd8/httpclient5/src/main/java/org/apache/hc/client5/http/impl/async/CloseableHttpAsyncClient.java#L57

**bottom line - I don't think it's possible to use this library to create a singleton obj. moreover - I don't think "close()" in the context of this client does what we originally thought i would do. I recommend we continue using the current use of this library **

david-navapbc avatar Jul 25 '24 23:07 david-navapbc

tl'dr

the closest we're going to come to a singleton client object that frees resources after each request is to change the ktor engine we're using from apache4 to apache5 - which is what's implemented.

the nitty gritty

There are two Apache engines available for ktor: Apache and Apache5

https://ktor.io/docs/client-engines.html#apache

The one we were using was Apache which is Apache HttpClient 4.x under the hood. Their docs state that the Apache4 client does NOT de-allocate resources between calls. Mostly this has to do with the ConnectionManager attached to the client that manages the threads and cache for requests. You need to call close() on the client (the Apache client, not ktor) in order for resources to be cleared.

https://hc.apache.org/httpcomponents-client-4.5.x/quickstart.html

https://stackoverflow.com/questions/43454514/proper-usage-of-apache-httpclient-and-when-to-close-it

The apache5 client can be made to automatically free resources if used in the correct manner

https://www.baeldung.com/apache-httpclient-vs-closeablehttpclient#1-automatic-resource-deallocation-httpclient-4x

https://hc.apache.org/httpcomponents-client-5.2.x/current/httpclient5/apidocs/org/apache/hc/client5/http/classic/HttpClient.html#execute-org.apache.hc.core5.http.ClassicHttpRequest-org.apache.hc.core5.http.protocol.HttpContext-org.apache.hc.core5.http.io.HttpClientResponseHandler-

which is what the ktor Apache5 engine appears to be doing.

https://github.com/ktorio/ktor/blob/main/ktor-client/ktor-client-apache5/jvm/src/io/ktor/client/engine/apache5/ApacheHttpRequest.kt#L38C18-L38C25

https://hc.apache.org/httpcomponents-client-5.2.x/current/httpclient5/apidocs/org/apache/hc/client5/http/impl/async/CloseableHttpAsyncClient.html

Lastly, the connection pool in question - as in the main thing we're concerned about resource wise with respect to "closing" is an instance of PoolingAsyncClientConnectionManager

https://github.com/apache/httpcomponents-client/blob/97247f0c8ce248377b1e4a85bba5cff7f11b0fd8/httpclient5/src/main/java/org/apache/hc/client5/http/impl/async/HttpAsyncClientBuilder.java#L793

https://hc.apache.org/httpcomponents-client-5.1.x/current/httpclient5/apidocs/org/apache/hc/client5/http/impl/nio/PoolingAsyncClientConnectionManager.html

To conclude - using Apache5 engine frees resources when we're done using them in a way that using Apache4 engine does not.

That's as close as we're going to get to an efficient use of ktor given that ktor's "close" will nerf the client and force us to create a new one.

david-navapbc avatar Jul 27 '24 21:07 david-navapbc

tl'dr

the closest we're going to come to a singleton client object that frees resources after each request is to change the ktor engine we're using from apache4 to apache5 - which is what's implemented.

the nitty gritty

There are two Apache engines available for ktor: Apache and Apache5

https://ktor.io/docs/client-engines.html#apache

david-navapbc avatar Jul 27 '24 21:07 david-navapbc

wireshark_output.zip

The branch no longer exists on remote so I recreated the work based off of current master.

I then ran server2serverauthtests by way of CLI

./gradlew :prime-router:primeCli --args "test --run server2serverauth"

And monitored traffic from wireshark (see attached)

There are not obvious pre-flight requests in the set (http OPTIONS or HEAD verbs). There do appear to be a few more records in the Apache5 set vs the Apache4. Some of which are calls to /devstoreaccount1/*.

All tests pass locally for both versions.

Continuing to investigate to see if I can figure out why there are more Apache5 engine requests than Apache4...

david-navapbc avatar Aug 16 '24 21:08 david-navapbc

i've proven that the delta in the two different runs - Apache5 client v Apache4 - is not to blame for whatever wonkiness was seen when we attempted to deploy to stage. I observed the same delta in number of captured network calls/responses between runs of master branch.

For example.

one run contained the following sequence:

"33369","12.651348","127.0.0.1","127.0.0.1","HTTP","473","GET /devstoreaccount1/reports?restype=container HTTP/1.1 "
"33371","12.655706","127.0.0.1","127.0.0.1","HTTP","546","HTTP/1.1 200 OK "
"33375","12.689142","127.0.0.1","127.0.0.1","HTTP","2295","PUT /devstoreaccount1/reports/receive%2Fignore.temporary_sender_auth_test%2F8f114285-0ae9-40e5-82d2-0129ec0a8789.csv HTTP/1.1 "
"33377","12.698170","127.0.0.1","127.0.0.1","HTTP","504","HTTP/1.1 201 Created "
"33716","12.832941","127.0.0.1","127.0.0.1","HTTP/JSON","61","HTTP/1.1 201 Created , JSON (application/json)"
"33723","12.836528","127.0.0.1","127.0.0.1","HTTP/JSON","2295","PUT /api/settings/organizations/ignore HTTP/1.1 , JSON (application/json)"

and a subsequent run contained a different sequence as follows -- NOTE THE ERROR

"67765","19.279633","127.0.0.1","127.0.0.1","HTTP","473","GET /devstoreaccount1/reports?restype=container HTTP/1.1 "
"67767","19.282086","127.0.0.1","127.0.0.1","HTTP/XML","629","HTTP/1.1 404 The specified container does not exist. "
"67769","19.320590","127.0.0.1","127.0.0.1","HTTP","473","PUT /devstoreaccount1/reports?restype=container HTTP/1.1 "
"67771","19.329176","127.0.0.1","127.0.0.1","HTTP","428","HTTP/1.1 201 Created "
"67775","19.352653","127.0.0.1","127.0.0.1","HTTP","2347","PUT /devstoreaccount1/reports/receive%2Fignore.temporary_sender_auth_test%2Ff88db672-c107-493f-a0c6-8ed4c9996e1b.csv HTTP/1.1 "
"67777","19.361615","127.0.0.1","127.0.0.1","HTTP","504","HTTP/1.1 201 Created "

I'm going to check in with the team to see if that error is expected. Regardless, I'm now confident the introduction of the Apache5 client did not cause the observed failure during deployment. I think instead we may be looking at some sort of intermittent condition.

david-navapbc avatar Aug 19 '24 19:08 david-navapbc

spent several hours today troubleshooting with @adegolier and @JFisk42 and some folks from devops. don't fully grok what's happening or why - but here's the data

the Apache5 engine is - somehow - the root cause of the problem. When replaced with Apache4 all of the code changes in the PR work for all tests and smoke tests

with the Apache5 engine applied to ktor this call

,"HTTP","345","DELETE /api/settings/organizations/ignore/senders/temporary_sender_auth_test HTTP/1.1 "

causes this response to the client

    ~/nava/prime-reportstream/prime-router  on   platform/dav…client/redux *24  ./prime test --env staging --key [key] --run server2serverAuth                                     ✔  took 2m 4s   at 11:25:49  
Running the following tests, POSTing to https://staging.prime.cdc.gov:
server2serverauth   Smoke Test                  Use server2server (two-legged) auth against various waters endpoints
Running test server2serverauth...
{"message":"HikariPool-1 - idleTimeout is close to or more than maxLifetime, disabling it.","thread":"main","timestamp":"2024-08-20T19:00:05.158Z","level":"WARN","logger":"com.zaxxer.hikari.HikariConfig"}
{"message":"HikariPool-1 - Starting...","thread":"main","timestamp":"2024-08-20T19:00:05.161Z","level":"INFO","logger":"com.zaxxer.hikari.HikariDataSource"}
********************************
Output for test server2serverauth...
Exception: com.github.ajalt.clikt.core.PrintMessage, Error on delete of ignore.temporary_sender_auth_test: response.status: 503 body: Unsupported method
: gov.cdc.prime.router.cli.SettingCommand.abort(SettingCommands.kt:511)
gov.cdc.prime.router.cli.SettingCommand.delete(SettingCommands.kt:182)
gov.cdc.prime.router.cli.tests.Server2ServerAuthTests.deleteSender(AuthTests.kt:563)
gov.cdc.prime.router.cli.tests.Server2ServerAuthTests.doEcAndRsaEcKeyTests(AuthTests.kt:809)
gov.cdc.prime.router.cli.tests.Server2ServerAuthTests.run(AuthTests.kt:574)
gov.cdc.prime.router.cli.tests.TestReportStream.runTests$runTest(TestReportStream.kt:257)
gov.cdc.prime.router.cli.tests.TestReportStream.access$runTests$runTest(TestReportStream.kt:60)
gov.cdc.prime.router.cli.tests.TestReportStream$runTests$2$1.invokeSuspend(TestReportStream.kt:281)
kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:104)
kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:277)
kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:95)
kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:69)
kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:48)
kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
gov.cdc.prime.router.cli.tests.TestReportStream.runTests(TestReportStream.kt:280)
gov.cdc.prime.router.cli.tests.TestReportStream.run(TestReportStream.kt:228)
com.github.ajalt.clikt.parsers.Parser.finalizeAndRun(Parser.kt:348)
com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:218)
com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:245)
com.github.ajalt.clikt.parsers.Parser.parse(Parser.kt:42)
com.github.ajalt.clikt.core.CliktCommand.parse(CliktCommand.kt:457)
com.github.ajalt.clikt.core.CliktCommand.parse$default(CliktCommand.kt:454)
com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:474)
com.github.ajalt.clikt.core.CliktCommand.main(CliktCommand.kt:481)
gov.cdc.prime.router.cli.MainKt.main(main.kt:304)
*** Tests FAILED:  server2serverauth ***

here's what's fun about that

  • note how the http status code of 503 does not match the status text of "unsupported method"

  • when we looked at the azure logs we could not find any hint of the call. The reason is the call never made it past Azure Front Door (the gateway between the outside world and azure functions).

  • when we looked at the logs for Front Door, we noticed that the status code it was giving back to the client was httpStatusCode_d: 408 and that the error text was ErrorInfo_s: UnspecifiedClientError -- meaning what FrontDoor was sending back to the client was NOT what the client ultimately reported back to us.

  • as noted in the ticket text there is another HttpClient in the project that most of the Auth tests use for DELETE operations instead of HttpClientUtils.

  • it's only this one DELETE operation - the one that uses HttpClientUtils - that's the problem. Other calls that use HttpClientUtils for GET or POST operations all work.

  • all of this works when run locally. It's only when run against the stage env that shenanigans happen.

My best guess is that something about how ktor is wrapping Apache5 is causing the response to be mangled and perhaps is the root cause of the problem. Regardless - while we've had to revert the Apache5 engine from ktor, the other changes in the PR at least create a single client instance - as is recommended by ktor docs - and thus are retained.

The PR as been posted and perhaps in future we'll revisit whatever is going on here that caused all the weird...

david-navapbc avatar Aug 20 '24 23:08 david-navapbc