test-lists
test-lists copied to clipboard
Extra consistency checks on category codes and URLs
We have run into the issue, when using the lists in OONI, that some country lists present the following problems:
- The same URL present in different country specific lists, presents a different category code
ex. id.csv:http://denypagetests.netsweeper.com,NEWS,News Media,2014-04-15,citizenlab, kw.csv:http://denypagetests.netsweeper.com,CTRL,Control content,2014-04-15,citizenlab,
- The same URL is present in both the global and the country specific list
ex. global.csv:http://www.crazyshit.com,PORN,Pornography,2014-04-15,citizenlab,Updated by OONI on 2017-02-14 sg.csv:http://www.crazyshit.com,NEWS,News Media,2014-04-15,citizenlab,
We should add checks to the lint-lists.py script that checks if:
- There are inconsistencies in category codes across lists
- If a URL is present in the global list it should not also be present in the country specific list
On this second point I would like to hear from @sneft and others to know if this is reasonable or if it's maybe just a OONI specific usage of the lists.
Check №2 is a bit tricky for cis.csv. Should cis.csv be treated in a same way as global.csv for corresponding countries? What definition of CIS should it use? E.g. should it include Georgia?
At OONI we don't actually use cis.csv at all and that country list has not been updated in a pretty long while. I would go to the extent of suggesting we remove it or move it to another directory.
For point 1, I have no doubt there are a number of these inconsistencies. We tried to fix these as we encountered them but haven't ever made a systematic effort to clean them.
For point 2, I agree that if a URL is present on the global list it should not be on a local list. Our old testing system flagged when you attempted to upload a local list with a URL duplicated in the global list. Our logic was that the global and local lists are meant to be run as a single unit, so we wanted to avoid duplication. I know we had some cases where this was inconvenient (e.g. wanting to test a very narrow sample in a bandwidth-limited place) but to my knowledge OONI is flexible enough to better accommodate custom lists for special circumstances.
(This requirement does add a small burden of labour on list compilers, as in my experience the average person compiling a local list will reasonably (and often appropriately) add certain URLs that are duplicated in the global list. Perhaps this is just a matter of good documentation and instructions to list compilers.)