ooni.org icon indicating copy to clipboard operation
ooni.org copied to clipboard

Improve the blocking detection heuristics

Open hellais opened this issue 2 years ago • 6 comments

This is about making improvements to our data processing pipeline to:

  • Automatically detect blockserver IP addresses
  • Detect of cloudflare captcha pages
  • Detection of server-side blocking

hellais avatar Aug 26 '21 16:08 hellais

Based on my observations of Censored Planet data, 99% of unexpected results for HTTPS is status mismatch, tcp reset or timeout. That number is 94% for HTTP.

Those 3 signals will go a long way, though the distribution of errors is very skewed, and in some countries it's important to go further.

I believe a good heuristic that will cover a bunch of remaining cases is to compare the body title. OONI control could be smarter to request a page and pass a HTTP language header to the server that matches the user's language.

One big issue is to try to be too smart in the heuristics. It's way better to expose the signals to the use and let them decide.

For example, instead of outputting a smart "anomaly" flag, OONI could output "outcome" strings like: http/tcp_reset, http/status_mismatch:451, http/title_mismatch:Just a moment... (that's the Cloudflare DDoS protection) or http/title_mismatch:불법·유해정보사이트에 대한 차단 안내 (South Korea block page).

It would be amazing to see that in the API or Explorer. You can quickly see whether the measurements are consistent. Exposing the signals will allow the user to figure out what's going on a lot better than an extra smart heuristic. I never trust "blocked" signals, unless it comes with the explanation of why it thinks it's blocked. But then the explanation is what matters.

fortuna avatar Aug 31 '21 22:08 fortuna

The same logic applies to DNS. The DNS error and whether the IPs are global go a long way. After you compare against the control IPs, a small number of cases are missing. You can probably cover the rest with TLS tests or IP whois. You can then expose outcomes like dns/nxdomain, dns/local_ip, dns/ip_org_mismatch. You can explain the successes too: dns/ok_matches_control_ip, dns/ok_matches_control_org, dns/ok_valid_tls_cert

fortuna avatar Aug 31 '21 22:08 fortuna

What we need really is to rethink data analysis. It's way more than the blocking detection. We need to create a process to characterize censorship and tools to support that process.

The output of an algorithm won't do it. My worry about too much focus on an infallible algorithm is that it takes away from thinking about the process. Perhaps a good start is to document the steps you, @agrabeli and others take to analyze censorship, and also collect the signals that make you confident about your reports. That will better inform what you need to expose in the data and tools.

As a concrete illustration, see my Cuba DNS Dashboard. There's no smart blocking signal there. It's only collecting and exposing what has been observed. To characterize censorship is a process to slice and navigate the data in different ways until you understand and are confident.

Aggregation, visualization and navigation > smart classification.

I believe the Explorer could already meet many of those needs if we could:

  • Expose more details of what is observed
  • Give the option to group by ASN
  • Show aggregated stats on what was observed on the side

fortuna avatar Sep 01 '21 15:09 fortuna

One new finding from looking at the CP data: the Server header is extremely useful in detecting CDNs and some other hosting providers:

$ curl -s -D - https://storify.com | grep -i -e "Server"
Server: AmazonS3

$ curl -s -D - https://www.zendesk.com/ | grep -i -e "Server"
server: cloudflare

We could have a list of patterns for "trusted hosts", which means it came from a host we know, and not injected. I can imagine us outputting things successes with an outcome like http/ok/trusted_host/cloudflare

This assumes the censor will never inject a Server header impersonating a well-known hosting provider, which I think is reasonable. Even if that happens, we will be able to see in the outcome string. This is not an issue with HTTPS, since you can't inject content.

Note that this check should only apply to the first redirection, since an injected response may be a redirect to a blockpage hosted on S3 or some CDN.

This should address the Cloudflare issue, since the DDoS protection page has the Server: cloudflare header (example).

Having said all of that, there's one caveat. I remember seeing a case, I think in Iran, of apparent injection over HTTPS. The injection was happening between the CDN edge node and the origin, because the origin was HTTP and the edge was in the country. So you had a block page served by the CDN. So blockpage tests should happen before the trusted server test.

fortuna avatar Sep 10 '21 17:09 fortuna

Thanks for providing such detailed feedback on this issue! I am going to address some of your comments and questions below:

OONI control could be smarter to request a page and pass a HTTP language header to the server that matches the user's language.

The test helper used by web_connectivity and the test helper that is going to be used by the new test web_steps take this into account. Specifically the client send the full list of headers used in the request to the tested website (which includes the accept-language header) so that both can be done consistently.

Unfortunately this is sometimes not enough, as the behaviour of certain websites is also influenced by the IP address that is reaching it.

For example, instead of outputting a smart "anomaly" flag, OONI could output "outcome" strings like: http/tcp_reset, http/status_mismatch:451, http/title_mismatch:Just a moment... (that's the Cloudflare DDoS protection) or http/title_mismatch:불법·유해정보사이트에 대한 차단 안내 (South Korea block page).

This is a very good suggestion. In part we already do this by exposing the blocking key (one of dns, http-failure, tcp_ip, http-diff) inside of the scores key in the API response. I do think we should further enrich the semantics of these keys include more detailed breakdowns of the errors (for example distinguishing between a connection_reset, connection_timeout or connection_closed errors) and also support doing aggregation on these fields. Part of this is being taken into account in the design of the new web_steps test which is going to include many more possible outcomes. I also think it's a great idea to do some feature extraction to include more context on what exactly differed in the outcome.

I have often found myself doing this semi-manually by downloading the raw measurements and extracting the title tag to see if it's a server-side blockpage or the IP returned in the DNS response to match it against known DNS blockservers (or performing whois on them to understand if it's a known CDN).

To characterize censorship is a process to slice and navigate the data in different ways until you understand and are confident.

I think this is a really great way to frame it and I think it's quite aligned with the vision and goals we have in mind for the next generation tooling to be created in OONI Explorer such as the MAT.


One aspect that hasn't been mentioned, but I think is also worth considering is that the current data structures used for representing results from web_connectivity tests present challenges in doing this sort of feature extraction. One of the reasons for this is that we are grouping inside of a single measurement multiple requests, since the test is configured to follow redirects. In light of this if we were to do, say, feature extraction for the Server header, we could do it for the last URL in the redirect chain, but would potentially miss out on the first one. This issue is solved in web_steps as each measurement is always just one URL irrespective of redirects. The problem is still present, though, in the case of DNS where you could have multiple answers for a given domain name. This highlights the need for thinking of more ways of representing measurements that go beyond the one-to-one mapping between a measurement and one database row.

hellais avatar Nov 24 '21 16:11 hellais

We should unpack this issue into more detailed sub-tasks.

hellais avatar Nov 26 '21 16:11 hellais