probe-cli icon indicating copy to clipboard operation
probe-cli copied to clipboard

feat: call getaddrinfo directly and extract its return value

Open bassosimone opened this issue 3 years ago • 3 comments

Checklist

  • [x] I have read the contribution guidelines
  • [x] reference issue for this pull request: https://github.com/ooni/probe/issues/2033 and https://github.com/ooni/probe/issues/2029
  • [ ] if you changed anything related how experiments work and you need to reflect these changes in the ooni/spec repository, please link to the related ooni/spec pull request: TODO(@bassosimone): we need to mention the getaddrinfo_retval field

Description

This pull request modifies our "system resolver" implementation as follows. If we're compiled with CGO_ENABLED=0, we keep using the netgo implementation of the system resolver. Otherwise, we'll attempt to link with libc and call getaddrinfo directly rather than using the cgo system resolver implementation as the middle person.

This new design solves two problems:

  1. that, to date, we're not collecting the raw result of getaddrinfo (which is one of the steps of https://github.com/ooni/probe/issues/2033);
  2. that, to date, the cgo implementation does not correctly handle "no answer" on Android (which leads to "unknown failure" and, in turn, to the measurement being marked as failed, as documented in https://github.com/ooni/probe/issues/2029).

Implementation wise, the code I am using here is deeply based on the cgo implementation. (I've been quite precise in giving credit where due and in flagging which functions are under the BSD-3Clause license.)

Here are the key feature highlights of this pull request:

  1. we write platform-specific getaddrinfo handling code so to address Android issues;
  2. the same platform-specific code also wraps getaddrinfo errors so we know the original return value;
  3. we use a simpler model to handle getaddrinfo parallelism compared to the Go standard library (we don't need to be as general as the standard library is, therefore we can use a simpler implementation that requires to maintain less code);
  4. we modify all the code that saves results to include getaddrifo_retval as an observation field.

Now, let's discuss some annoyances of this diff:

  1. we need to modify three places in the codebase to propagate getaddrinfo_retval. The reason why historically we have three places is an ongoing refactoring, which is documented by https://github.com/ooni/probe/issues/2035.
  2. this diff increases the complexity of testing the Android codebase. This occurs because there's now a split between the Android and the Linux implementations. This mainly stems from the fact that EAI_NONAME is only available on GNU/Linux if you -D__GNU_SOURCE. Perhaps, I should just -D__GNU_SOURCE? But, I'm not sure about musl. So, maybe we should consider experimenting a little more here and figuring out if there's a compact way of unifying Linux and Android here?
  3. this diff increases the complexity of the codebase, in that now it's our problem to deal with the differences of each libc when we're calling getaddrinfo (for example, see my doubts about regarding GNU libc versus musl libc)
  4. this diff probably requires us to change the way in which we cross compile miniooni. It now seems a bit annoying that the build we get when cross compiling does not have cgo by default, so we're basically getting a default resolver (the one you get when you're not using cgo, which is called netgo) that does not record the return value of getaddrinfo. But, then, maybe this is just telling us that we should completely bypass Go's resolver also in the CGO_ENABLED=0 case (i.e., mainly cross compiling), by borrowing from the Go stdlib some more code that helps us in reading /etc/resolv.conf?
  5. one may argue that a simpler fix was that of just intercepting the string emitted by Android in that specific case and have a somehow hackish diff to handle that (but I am not super happy about piling hacks over hacks, so I felt it was proper to try and solve the underlying problem rather than continuing to apply band aids)
  6. it may be that we'll be facing increasing issues with i18n domain names and perhaps this is more likely to occur in Windows where I suppose we're using the ASCII version of getaddrinfo - and perhaps this should also be tested?

bassosimone avatar Feb 22 '22 11:02 bassosimone

Thanks for putting this together, it looks great!

In general the approach seems reasonable. Similarly to what you wrote in your comment, I am a bit concerned about the increase in complexity in the codebase and the potential issues that can arise from different platforms and libraries.

It's probably worth spending a bit of time testing this on various platforms and perhaps add some guards to our build tooling so that we don't make builds linking to untested/unsupported libc versions or platforms.

Another issue we should consider is that, if I understand how this works correctly, depending on whether or not we have built with CGO_ENABLED=0 on or not, we are going to be measuring things in a different way (using our cgo inspired getaddrinfo implementation or using netgo). This might present issues when analyzing or interpreting the data.

Do we perhaps want to add some field to the output data format that gives us an indication of which DNS resolution code was used to generate the the metric?

Based on the comment here: https://github.com/ooni/probe/issues/2029#issuecomment-1048748847, I gather that we don't want to anyways ship this in the next release, so it's fine to spend a bit more time testing and discussing improvements to it. Is this correct?

hellais avatar Feb 23 '22 14:02 hellais

Thanks for putting this together, it looks great!

Thank you for your review!

I ended up taking just the netxlite part of the diff, applying changes to improve testing, and merging it as part of https://github.com/ooni/probe-cli/pull/764.

In general the approach seems reasonable. Similarly to what you wrote in your comment, I am a bit concerned about the increase in complexity in the codebase and the potential issues that can arise from different platforms and libraries.

It's probably worth spending a bit of time testing this on various platforms and perhaps add some guards to our build tooling so that we don't make builds linking to untested/unsupported libc versions or platforms.

Yes, this was a good advice. I spent extra time trying to figure out the possible configurations, which led me to improve my analysis of what actually happens and to improve the CI. See:

  • https://github.com/ooni/probe-cli/pull/764
  • https://github.com/ooni/probe/issues/2029#issuecomment-1140258729
  • https://github.com/ooni/probe/issues/2029#issuecomment-1140805266

Another issue we should consider is that, if I understand how this works correctly, depending on whether or not we have built with CGO_ENABLED=0 on or not, we are going to be measuring things in a different way (using our cgo inspired getaddrinfo implementation or using netgo). This might present issues when analyzing or interpreting the data.

Do we perhaps want to add some field to the output data format that gives us an indication of which DNS resolution code was used to generate the the metric?

Yes, good point. So, I actually introduced in

  • https://github.com/ooni/probe-cli/pull/765
  • https://github.com/ooni/probe-cli/pull/766

support for distinguishing which resolver we're using. The approach has been quite simple: if we know we're calling getaddrinfo, continue to name this "system" (as we have been doing since MK). Otherwise, call that "go", which may be "netgo" or actually getaddrinfo depending on the platform and how OONI has been built.

Our aim is to use getaddrinfo (aka "system") whenever possible. To this end, we need to avoid cross compiling, so we can explicitly link with libc. We're currently not cross compiling ooniprobe, but we cross compile miniooni, hence https://github.com/ooni/probe/issues/2119.

Based on the comment here: ooni/probe#2029 (comment), I gather that we don't want to anyways ship this in the next release, so it's fine to spend a bit more time testing and discussing improvements to it. Is this correct?

Yes. For 3.14 and 3.15 we have been using a simplistic patch. Based on my analysis at https://github.com/ooni/probe/issues/2029#issuecomment-1140258729, the patch is not so correct. Android's getaddrinfo is really bad in terms of data quality. Because of this, I proposed https://github.com/ooni/probe/issues/2120.

BTW, this pull request now only contains a small diff that collects getaddrinfo's return value, since the underlying mechanism for calling getaddrinfo has been merged in https://github.com/ooni/probe-cli/pull/764. I would like to keep this PR open and work towards merging the remainder of the diff $soon.

bassosimone avatar May 30 '22 08:05 bassosimone

Convering the PR to draft since the code that remains to be merged in this diff is still a bit of a draft.

bassosimone avatar May 30 '22 08:05 bassosimone