ooni.org icon indicating copy to clipboard operation
ooni.org copied to clipboard

test-lists: Create script to automatically updated URLs that support HTTPS to HTTPS

Open agrabeli opened this issue 2 years ago • 3 comments

As part of https://github.com/ooni/ooni.org/issues/363, it would be great if we could write a script that automatically updates URLs (in the Citizen Lab test lists) that support HTTPS to HTTPS.

This will significantly simplify the test list review process for researchers, and it will boost OONI measurement quality.

agrabeli avatar Sep 09 '22 10:09 agrabeli

We should not do this or, at least, we should not do this $soon. The policy that the issue is proposing is grounded on years of experience with Web Connectivity v0.4 and previous versions implemented in Measurement Kit. Under these implementations, we were not testing https://example.com/robots.txt if the URL was http://example.com/robots.txt.

(Incidentally, the ooni/probe-legacy implementation did not have this limitation. We introduced this limitation because we didn't want the UI of the mobile app to show two distinct measurements for each included input.)

However, with version v0.5 of Web Connectivity we're going to test the HTTPS version of an URL is the URL is HTTP, therefore one of the main reasons for upgrading vanishes. Many HTTP URLs in the test lists were added because there was suspicion that the HTTP version of the website could have been blocked. So, with v0.5 we respect the original intent by checking for HTTP while at the same time also checking for HTTPS.

My suggestion would be to wait until the roll out of v0.5, then reconsider whether to apply this change in light of whether the new implementation improves the current situation in terms of detecting censorship.

bassosimone avatar Sep 16 '22 10:09 bassosimone

@bassosimone I think there is another reason why this update is important – possible duplicates across the lists. For example, there is a website on http in global list, and someone adds the same domain but with https to the country-specific list. As far as I know, there will be no conflict for such an update unless we search for the domain name manually in the global list.

sloncocs avatar Sep 16 '22 10:09 sloncocs

@sloncocs:

@bassosimone I think there is another reason why this update is important – possible duplicates across the lists. For example, there is a website on http in global list, and someone adds the same domain but with https to the country-specific list. As far as I know, there will be no conflict for such an update unless we search for the domain name manually in the global list.

Yeah, not having this capability inside the scripts that run automatically when one submits a new URL significantly slows down our capabilities in terms of fostering fast contributions from the community. I agree that this is not desirable.

I think it should be possible to write automatic checks for ensuring the domain is not duplicated regardless of the URL scheme. At the same time, it's probably not trivial, because the script needs to build a database of the existing "state" of the test lists. Then, it needs to ask the question whether adding the URL changes the state in terms of tested domains.

bassosimone avatar Sep 16 '22 11:09 bassosimone

One idea that was floated around during the team meeting was that we could always treat a https:// input as a http:// one, which would result in http and https being measured. If that is the case, the next step is to open a backend issue to do that.

Alternatively we could update the test lists in the other direction (i.e. update all https:// URLs to http://).

Some things to consider in doing this are:

  1. What happens if the site doesn't support http, but only supports https (in this case maybe we do want to have some automatic way of detecting that and in that case encode this information in some way in there so that the backend can know what do do)

  2. How does this impact the coverage of measurements. If we update a bunch of URLs, since the coverage stats are computed on a per-input basis we might have a hard time linking back together related measurements.

hellais avatar Nov 07 '22 11:11 hellais

This has been done as part of: https://github.com/citizenlab/test-lists/pull/1247.

I suggest we document follow up work as new issues.

hellais avatar Mar 19 '23 16:03 hellais