opentelemetry-android Http/sURLConnection auto instrumentation

What's covered in this PR

As the Android SDK classes can't be instrumented with ByteBuddy Android Gradle Plugin, this PR instruments all calls to relevant URLConnection/HttpURLConnection/HttpsURLConnection APIs in android applications source code.

APIs Instrumented:

APIs that connect but do not read bytes:

connect()
getOutputStream()

APIs that connect and read bytes:

getContent()
getContent(Class[])
getContentEncoding()
getContentLength()
getContentType()
getDate()
getExpiration()
getLastModified()
getHeaderFieldDate(name, Default)
getHeaderFields()
getHeaderFieldKey(int)
getHeaderField(int)
getHeaderField(name)
getHeaderFieldInt(name, Default)
getResponseMessage()
getResponseCode()
getInputStream()
getErrorStream()
getContentLengthLong() ( added in API level 24)
getHeaderFieldLong() ( added in API level 24)

Approach Taken

Request header can only be added before connecting. So, trace context header should be added before any of the above API calls are made.
All above API calls are replaced with our own replacement methods which do the following:

Add connection to our ongoing URLConnections map if not added already. If it is the first time connecting, trace context is added as a header to the outgoing request
Original API call is made. If method can throw an exception, we wrap the method around a try/catch, so we can close the ongoing spans with exception reporting before re-throwing exception back to client.
We update the last seen timestamp for the connection. If API is amongst the ones that read bytes also, we mark this URLConnection as harvestable.

Harvestable meaning here - if harvestable and if not already reported and if left idle for more than CONNECTION_KEEP_ALIVE_INTERVAL ms, connection is reported by a connection killer thread. So, once connection is made and any data is read from the connection, we mark it as harvestable because we don't anticipate connection to be sitting idle for longer after this.

As noted above, We keep track of all ongoing URLConnections and three other properties for it - last seen time, harvestable (boolean), reported (boolean)

When are spans closed and reported:

If any exception is thrown anywhere, spans are closed and reported.
As a general behavior, either getInputStream or getErrorStream is always called and called in the end. We close spans and report when these two methods are called.
A connection reporter is scheduled which periodically check if any tracked connection, which is marked harvestable and not reported yet, was last seen more than CONNECTION_KEEP_ALIVE_INTERVAL ms before, then it closes the spans and reports the connection. [Currently, couldn't think of a way to proactively schedule this for clients, customers can manually call our API to schedule this runnable for now.]

Sample App to test this PR Added an android sample app with many test scenarios for this PR here.

If you would like to test this PR using the above sample app:

Pull this PR in your local repo.
Update the return value to some dummy value for the API you would like to test in HttpUrlReplacements.java
Update gradle.properties with relevant version number
Use ./gradlew publishToMavenLocal

In sample app:

Update the version of httpURLConnection agent and lib dependencies to the version number you just published
Build and run the app. Click the "ClickMe" button.
See output in logcat.

Oct 30 '23 23:10 surbhiia

Thanks @surbhiia I think it's a good start, although I noticed that there are some comments and todos in the wrapper class that seem like we should discuss further, probably in a call if needed.

I've also created this issue to prevent API level restrictions on this module. I think that if we can verify that no runtime issues would happen when the wrapper class is loaded in envs where some methods aren't available, then it should be fine to ignore API restrictions for these cases.

Nov 02 '23 10:11 LikeTheSalad

I've also created this issue to prevent API level restrictions on this module. I think that if we can verify that no runtime issues would happen when the wrapper class is loaded in envs where some methods aren't available, then it should be fine to ignore API restrictions for these cases.

Thanks @LikeTheSalad! I added those 2 APIs in the wrapper class and instrumented the sample app with that and ran the instrumented app on an API level 21 image and it worked perfectly! No issues at runtime!

Nov 02 '23 22:11 surbhiia

Oops, mistakenly closed, re-opening again!

Nov 02 '23 22:11 surbhiia

Thanks @surbhiia for the contribution! Please add a README.md for the instrumentation. At a minimum it should contain:

Status: Experimental
What a user needs to do to opt into this instrumentation
- Build changes
- Starting the harvester
A very brief description of what this does (bytecode instrumentation wrapping user calls to the http call sites)

Nov 14 '23 23:11 breedx-splk

As discussed, I've created #148 so that we can defer adding test to this (large!) PR. I'll assign it to you @surbhiia if you think you'd like to also take that on after this goes in. 🙏🏻 Thanks.

Nov 14 '23 23:11 breedx-splk

Thanks @breedx-splk! I've added a readme now! Do let me know if anything looks amiss.

Thanks for creating the issue. Do assign it to me. I'll add the tests in next PR. :)

Nov 15 '23 01:11 surbhiia

@surbhiia @LikeTheSalad We chatted about some edge cases, specifically related to use cases where the client just pumps data (like a POST) and doesn't bother to read the response. I have put together a test gist that does exactly this: https://gist.github.com/breedx-splk/246962f7d1c180f924a530630a966c9f

In my testing with that, I run the collector locally and then I used netcat (nc) to see the request payload. I then ran that main method with the java agent attached and observed the following:

No traces sent to the collector (debug logger enabled in the collector)
No input seen by nc.

This indicates to me that the upstream instrumentation is behaving correctly -- it doesn't create a span because the request/response has not taken place. I think the HttpURLConnection "apis" (if you can even call them that!) expect the user to read the response in some way -- it is very very lazy.

If you add int responseCode = conn.getResponseCode(); after the flush() for example, the request is actually sent.

With that in mind, I definitely don't think it's something we should worry about covering this early in the instrumentation development.

Nov 15 '23 21:11 breedx-splk

Thanks for creating the issue. Do assign it to me. I'll add the tests in next PR. :)

😎 @surbhiia You have to comment on the issue before I can assign. Just a simple "assign me please" will do. 👍🏻

Nov 15 '23 22:11 breedx-splk

@surbhiia @LikeTheSalad We chatted about some edge cases, specifically related to use cases where the client just pumps data (like a POST) and doesn't bother to read the response. I have put together a test gist that does exactly this: https://gist.github.com/breedx-splk/246962f7d1c180f924a530630a966c9f ...

@breedx-splk Thanks for testing this! I created this issue about it in that repo yesterday and wanted to test this.

Other than getInputStream/getErrorStream, getResponseCode and getResponseMessage are other two methods generally always called. Java instrumentation is closing spans for the latter two as well, as those methods inherently call getInputStream. We're instrumenting the call sites itself and we've not ended spans for the latter two right now. We can perhaps do that but that isn't a good idea as then we will close early in many cases and miss out on any exceptions occurring after that as usually first getResponseCode is checked and based on that either getInputStream/getErrorStream is called and we want to capture any IOExceptions there. Sort of a trade-off between waiting to close in order to capture any possible exceptions vs closing early in order to have a span ended for more possible scenarios. I still think we should go with the former as we can have the correct telemetry reported.

Right now in the readme we've mentioned scheduling of the harvester as a required configuration. We can perhaps make it optional and call out how it affects the behavior. And in future we can figure out how to schedule that too automatically if possible :)

Nov 15 '23 23:11 surbhiia

We can perhaps do that but that isn't a good idea as then we will close early in many cases and miss out on any exceptions occurring after that

is this any worse than the Java agent instrumentation? (I'm guessing we're probably closing the span early in these cases too?)

Nov 15 '23 23:11 trask

is this any worse than the Java agent instrumentation? (I'm guessing we're probably closing the span early in these cases too?)

@trask No, we're not closing spans early, instead waiting till the end and we're scheduling a thread to periodically look for idle connections and close spans.

Nov 16 '23 00:11 surbhiia

is this any worse than the Java agent instrumentation? (I'm guessing we're probably closing the span early in these cases too?)

I think we're all in agreement here -- we're doing the best we can with a problematic legacy classes. 😅 I don't think this is any worse than what is happening in the agent instrumentation. This is acknowledging that, under some circumstances, we may not correctly close an open context, so there is a periodic harvester for those edge cases.

Nov 16 '23 00:11 breedx-splk

my recommendation would be to match the Java agent instrumentation, which since instrumenting call sites here, would mean making a list of all the the HttpURLConnection methods that end up calling one of connect, getOutputStream, or getInputStream, and ending the span if any of those methods are called

this would eliminate the need for a harvester thread (which we don't have in the Java agent instrumentation)

Nov 17 '23 00:11 trask

end up calling one of connect, getOutputStream, or getInputStream

Actually ending spans at connect might not be a good idea as that's the first function being called from all APIs before anything can be read from/written to the server we're connecting to. Java agent ends only if any throwable occurs or if getInputStream is called. I think getOutputStream can also be added to that to cover the post scenario.

The only difference android agent has right now is that we're not ending spans on getResponseCode, getResponseMessage methods (which inherently call getInputStream). I'm more leaning towards keeping the harvester thread as a recommended configuration, calling out clearly in the readme what's the effect with/without it.

Nov 17 '23 01:11 surbhiia

I'm more leaning towards keeping the harvester thread as a recommended configuration

should we be adding a harvester thread to the Java agent instrumentation too?

Nov 17 '23 02:11 trask

should we be adding a harvester thread to the Java agent instrumentation too?

As HttpURLConnection doesn't have a strict final API that should be called always, java agent might be missing out on ending spans in all possible scenarios. Having a harvester thread would be helpful to collect all these spans too. It could be opt-in / recommended and customers can decide based on how their HttpURLConnection workflows are.

Nov 17 '23 18:11 surbhiia

java agent might be missing out on ending spans in all possible scenarios

I'm still looking for a repro with one of these scenarios. :)

I tried (in the above gist) to make that happen with the POST without reading the response, but it turns out that a span doesn't get created because the POST doesn't even happen unless the response is read. @surbhiia do you know of another possible scenario where this might could happen?

Nov 17 '23 21:11 breedx-splk

@surbhiia do you know of another possible scenario where this might could happen

@breedx-splk Can you please try something like below:

//Example 1: No exception occurs and getInputStream is not called
URL url = new URL("<Your URL>");
URLConnection urlConnection = url.openConnection();
Map<String, List> map = urlConnection.getHeaderFields();
// Or any other similar api’s other than getInputStream/getResponseCode/getResponseMessage that read from connection  like getContentType, getHeaderField(index)

Also, the example you mentioned is a reproduction of the issue right, as no span is created? POST should happen without reading anything, that's strange. Are you setting connection.setDoOutput(true); Also, I think it does a PUT, and to do POST you need to add this too - connection.setRequestMethod("POST"); This should be done before calling getOutputStream

Nov 17 '23 22:11 surbhiia

I tried (in the above gist) to make that happen with the POST without reading the response,

You mean you tried with android codebase or java-agent to reproduce this? (in android codebase we've not marked request to be harvestable (info.harvestable = true) if they just call getOutputStream but I think we should mark it harvestable even after getOutputStream. Just connect should not mark request harvestable. )

Nov 17 '23 23:11 surbhiia

I would love to see the Android and Java agent HttpURLConnection instrumentations produce the same telemetry (even though one is call-site based).

@surbhiia maybe you can remove the harvester thread from the initial Android instrumentation, and we can discuss pros and cons of adding that to both the Java agent and Android instrumentation in a follow-up?

Nov 21 '23 00:11 trask

@surbhiia @breedx-splk Looking at the UrlConnection implementation that is used on Android API 19+, I can confirm that the request body written to the output stream isn't sent until you try to read a response or call connect on it afterwards.

This is bit confusing because getOutputStream() already calls connect() first, but that first call only establishes the connection given there's no request body, and you would have to call it again to actually send the request (or it's called implicitly if you try to get something that requires the response).

Given that, I think the existing Java instrumentation is behaving correctly like Jason said - that a request was actually not sent, so there should be no automatic instrumentation - and I don't think you need the harvester thread.

Nov 21 '23 18:11 bidetofevil

Summarizing where we are at right now based on all discussions so far (here, in sig meeting and slack) - I got a chance to try out java agent instrumentation and can confirm that getOutputStream call doesn't lead to anything being posted to the connected server until any of the read apis are called (This behavior is same in android too). So, there is no need to start or end span for just getOutputStream api call scenario.

Java Agent - I tested out other scenarios too where someone just calls any of the other read apis (getContent, getHeader... etc) and not calls any of these 4 (getInput/ErrorStream, getResponseCode/Message). In java agent, we're ending spans for all of these read APIs also. And java agent doesn't need to do that explicitly, it just defines the span ending logic for getInputStream API and seems like all these other read APIs internally call getInputStream, thereby leading to span ending for all.

Android Agent - As android is instrumenting call sites and not the actual HttpURLConnection API, and we have added the logic to end spans for only getInput/ErrorStream calls, we're not ending spans for all the read apis right now.

The only open question now is : Should android also define the span ending logic for all the read apis. My only concern with that is - we will end up ending spans too early and not capture any IOExceptions that can occur on any other read api calls after this first read api call. Ending at getInput/ErrorStream is a little bit safe as it's generally called in the end. And if getInput/ErrorStream is not called, harvester will be able to end spans.

Nov 22 '23 19:11 surbhiia

we will end up ending spans too early and not capture any IOExceptions that can occur on any other read api calls

Do you know if this is a problem for the Java Agent too?

Ending at getInput/ErrorStream is a little bit safe as it's generally called in the end. And if getInput/ErrorStream is not called, harvester will be able to end spans.

I usually don't work with this tool so I'm not sure how often IOExceptions can happen when using it, if it's a common thing, then I agree we should cover that scenario, not only here but in the Java agent as well, so this really seems like a great topic for the Java SIG.

For now, without knowing whether it happens on the Java Agent too, or if it's a common issue or an edge case, this is what I think that we could do:

Making the harverster solution configurable, whether its default is enabled or disabled I think would depend on what's the current behavior of the Java Agent regarding IOExceptions.
If the Java Agent doesn't handle IOExceptions but they are quite common, then I think we can enable the harvester by default and create an issue for the Java Agent to make it exception-aware too.

The reasons why I think it should be configurable are:

I know some people could be quite reluctant about having too many threads running in their apps. This might no longer be an issue with newer phones though, but I remember it used to be a performance concern some years ago and probably still is for some apps.
The main reason we need a harvesting feature seems to be due to bad usage of the httpurlconnection "API" in the host app (because people wouldn't call getInput/ErrorStream in the end). However, if I were a pro user of this API and I knew I wouldn't make that mistake in my project, then I would prefer to disable the harvester option to reduce the amount of background work going on in my app.

An alternative I can see, to make sure users would always call getInput/ErrorStream in the end, could be to create our own lint rule that would raise a compilation warning whenever it spots that getInput/ErrorStream aren't called, and publish it into this agent's library artifact.

Nov 23 '23 09:11 LikeTheSalad

Do you know if this is a problem for the Java Agent too?

Yes, based on code it should be a problem with java agent too as it closes spans on first read api and no new spans are created on the same connection object afterwards, exception or no exception.

if it's a common thing

I do not know in practice how common these issues are, but some examples of these scenarios could be:

One example of exception occurring after a few read apis have already been successful - Connection being lost in between (which could happen for a lot reasons - timeouts, server temporarily unavailable / overloaded, internet connection lost etc) which would result into an Exception.
One example of getInputStream not being called in the end could be an "OPTIONS" request method. Where you just want to look at headers to figure out what methods the server supports. Another example could be a HEAD request.

If the Java Agent doesn't handle IOExceptions but they are quite common, then I think we can enable the harvester by default and create an issue for the Java Agent to make it exception-aware too.

I think we should by default cover any possible exception scenarios - enable harverster by default. And making it configurable allows those with issues with multiple threads running can disable it at the cost of losing some spans - when they don't call getInput/errorStream and we use the lint rule as you suggested to point out this.

An alternative I can see, to make sure users would always call getInput/ErrorStream in the end, could be to create our own lint rule that would raise a compilation warning whenever it spots that getInput/ErrorStream aren't called, and publish it into this agent's library artifact.

Nov 28 '23 01:11 surbhiia

The only open question now is : Should android also define the span ending logic for all the read apis. My only concern with that is - we will end up ending spans too early and not capture any IOExceptions that can occur on any other read api calls after this first read api call. Ending at getInput/ErrorStream is a little bit safe as it's generally called in the end. And if getInput/ErrorStream is not called, harvester will be able to end spans.

I think whatever we do in Android should be the same as what the Java agent does. If this is an issue there as well, I think we should ping those folks if we can find a repo case that.

I do not know in practice how common these issues are, but some examples of these scenarios could be:

One example of exception occurring after a few read apis have already been successful - Connection being lost in between (which could happen for a lot reasons - timeouts, server temporarily unavailable / overloaded, internet connection lost etc) which would result into an Exception.

I believe that most server/network errors like timeouts occur during when the bytes are streamed across the the wire to populate the input stream, and none of the read APIs will return until this is done. The errors you can from reading the input stream is happening client side, so like corrupt data or data that doesn't fit the stream's expectation (e.g. if you try to push uncompressed data through a GzipInputStream.)

I think we should come up with a repro test case to see if any of read APIs will actually return before the server finishes streaming all the bytes. Because if this is actually a client read error, I think it should be logged apart from the network request span as that has completed already at this point.

One example of getInputStream not being called in the end could be an "OPTIONS" request method. Where you just want to look at headers to figure out what methods the server supports. Another example could be a HEAD request.

Do you think you can reproduce these cases on a test as well? I think it'd be good to have something concrete first before implementing the harvester in case it's not necessary. The Android implementations of Http[s]UrlConnection may surprise you in how they behave, so it would be good if we can see one of these scenarios happen before implementing fixes for them.

Nov 28 '23 16:11 bidetofevil

I've not had time to look into it fully yet. I was able to repro the exception scenario while reading from input stream and server goes down. Next, I will repro the HEAD/OPTIONS scenario where same connection object is used but disconnected and connected again for another full GET request. I will get back on this as soon as possible and update you all about what I find.

Dec 04 '23 20:12 surbhiia

In the spirit of incremental changes and continuous improvement, what if we started with a non-reaper version and assume that the edge-cases are minimal enough for now. If/when the problem is demonstrated we can expand the try/catch/close parts and/or introduce a reaper?

That might get this unblocked and move it forward while keeping closer parity with upstream instrumentation.

Thoughts?

Dec 06 '23 23:12 breedx-splk

I wonder if it's worth the effort to maintain the Http/sURLConnection instrumentation, I mean, the PR is done so LGTM, but I don't see apps using Http/sURLConnection for a long time, it is always OkHttp.

Dec 14 '23 09:12 marandaneto

Sorry about the delays in getting back on this. I was caught up with other work. I finally got a chance to try out possible scenarios where we might get exceptions later in the flow and might want to wait for ending spans and not end them at first read API call.

I tried an found the following:

Calling any read API, gets everything at once from the server so you do not have to go to server again for consequent read APIs ( like getHeader would also get the inputStream object. Note - reading from input stream requires you to go back to the server - covered in point below). So, even if I set caching to false, or shut the server before the consequent read API instruction is called, it doesn't lead to IOException.
IOExceptions do occur when I'm reading from the inputStream. I tried shutting down the server after an initial read API was successful and some inputStream.read was successful but before the entire input stream is read. This does cause exception. I've added a new commit to this PR, to look into inputStream/errorStream read operation as well and capture any exceptions on that due to reasons like "server no longer available" or other connectivity issues. This is also not a case of bad implementation of HttpURL logic in your code but a genuine possible error at run-time.
I tried creating HEAD/OPTIONS request, and then disconnecting and trying to use the same connection object for a GET request. To my surprise, HttpURLConnections handles it well. It doesn't allow same connection object to be re-used. You always have to open new connection for each request (internally it takes care of connection pooling, and you might save resources by using the same connection but the instance of connection in customer code has to be different always).

So, the only scenario to justify not ending span at the first read API is -- IOExceptions during inputStream/errorStream read. And harvester will help close any un-ended spans if getInputStream/getErrorStream is not called at all OR if connection was stuck for too long (>10s) between the last inputStream.read and the next. (Please refer the latest commit where this logic is implemented)

I think if we can answer the following question, we can conclude our discussion - Do we need to capture the above scenario? If not, we do not need the harvester. If we're still divided on whether to capture this scenario or not, we can for now remove the harvester for the first phase and re-visit it later.

Dec 14 '23 18:12 surbhiia

I wonder if it's worth the effort to maintain the Http/sURLConnection instrumentation, I mean, the PR is done so LGTM, > but I don't see apps using Http/sURLConnection for a long time, it is always OkHttp.

I did look into it before starting development on this. A lot of apps still do use Http/sURLConnection (I guess due to it being a part of the Android SDK and light-weight). Amongst our customers, there are more usages of Http/sURLConnection (1st) than Okhttp (2nd highly used client).

Dec 14 '23 18:12 surbhiia