test-lists icon indicating copy to clipboard operation
test-lists copied to clipboard

Add a tags column

Open hellais opened this issue 6 years ago • 5 comments

I am thinking that it would be valuable to have an extra column to annotate in a machine readable way properties of the URLs.

The use-case of this is, for example, instead of removing dead URLs that should no longer be tested, simply adding the inactive tag or for URLs that are generated as part of some automated tooling, marking those as autogenerated, etc.

I think having a generic tags column is sufficiently flexible to support many use-cases.

Thoughts?

Does this create parsing issues for people consuming these test lists?

I believe this would also make it easier to integrate the @berkmancenter effort (see: https://github.com/citizenlab/test-lists/issues/236) cc @sneft @agrabeli @jakubd @rpanah @jdcc

hellais avatar Dec 04 '17 10:12 hellais

This could possibly also solve the problem of: https://github.com/citizenlab/test-lists/issues/22.

We could have specific tags for safety as well.

hellais avatar Dec 04 '17 10:12 hellais

tl;dr I definitely support this.

I know at least for us, unrecognized columns in the CSVs won't break things (or I'll fix them). I think as long as we pick a tag separator that won't trip up common CSV parsers, things will generally work as expected.

I can envision a number of ways a tagging system would be useful to us specifically, which might in fact be a problem: a lot of folks can develop their own tags with utility only to them. That in itself is not a problem, but GitHub PRs could prove burdensome for maintaining a folksonomy. I've helped build a collaborative tagging platform in the past, and it can get complicated. Some of the complications I see:

  • When will you approve PRs that remove tags?
  • Will you approve all PRs that add tags?
  • How about if the PR adds the safe tag to a URL also tagged as unsafe?

Maybe we don't allow for a fully open taxonomy to prevent some of these issues? If that were the case, we'd need taxonomy docs plus some light process for changing the taxonomy. If we close the taxonomy, we do lose some of the utility of a tagging system - namely tags with single-org relevance. I know that if I wanted that feature back in a closed-taxonomy world, I'd do the same thing we're trying to avoid, which is fork the repo.

In either case, I think we'd need some PR approval guidelines. Maybe we have a mixture model where we reserve some set of tags as a common taxonomy, and then have some simple rule for approval like "You can only mess with tags you added." Enforcing a rule like that manually is hopefully fine, as doing so in code sounds like a huge pain given the CSV format and git line-level diffing. The sound of "building a tagging platform inside CSV cells inside GitHub" is eerily similar to the sound of a thousand expletives screamed down an abandoned well.

We've actually toyed with the idea of URL lists existing simply as a collective tagging exercise within some tagging platform. That's an interesting idea to me, and I'm not totally convinced it's horrible, but that's ten steps too far in this particular case.

jdcc avatar Dec 04 '17 21:12 jdcc

I will answer some of your questions, though would invite also @sneft and @jakubd to think about this.

When will you approve PRs that remove tags?

I would say that we have not been so good at coming up with policies for accepting or rejects PRs upfront, but have rather developed them as part of working on it and improving on them as we go along.

I would say that it's probably wise to come up with a set of predefined tags and annotate them similarly to the category codes in it's own CSV.

I think that if we think this is a good idea we should start the process of adding tags columns and see as we go forward what works best and what doesn't.

Will you approve all PRs that add tags?

I think this again, is something that we will have to see as we work on it. I would imagine that some set of tags are pretty easy to consider accepting them (like tags that state a fact), while others may require more discussion as they may be more subjective.

How about if the PR adds the safe tag to a URL also tagged as unsafe?

I think we can add some consistency or "logic" tests to the continous integration already present so that we don't end up merging by mistake inconsistent tags

@jdcc I like the rational you outline below and I would suggest that as a first next step we come up with a list of tags that should be part of the common taxonomy and iterate from there.

Some tickets that are also maybe relevant to this discussion are:

  • https://github.com/citizenlab/test-lists/issues/21
  • https://github.com/citizenlab/test-lists/issues/22
  • https://github.com/citizenlab/test-lists/issues/23

hellais avatar Feb 03 '18 07:02 hellais

We have a real world use case here for the need for sub-categories or tags for a few categories, for example News Media in order for us to auto-generate reports or dashboards that are more useful for doing comparisons.

When it comes to News Media, to measure press freedom for example, we need to separate in some countries like Malaysia, between independent, politically owned/connected to incumbent authoritarian government and foreign (international/regional).

http://blockornot.today/

On determining these sub-categories, tags, we could for example refer to existing global indexes such as World Press Freedom Index https://rsf.org/en/ranking to see what categories they use.

I believe tags should be relatively free, because they are perfect for one of or specific categorization uses. Sub-categories however of categories such as News Media or Political Criticism, probably would probably need specific controlled sub-categories so we can have globally consistent comparisons that are more specific than the current categories.

kaerumy avatar May 04 '18 00:05 kaerumy

In situations where columns are getting added the effect on people already processing the list should be minimal. Having additional details about URLs would be helpful, though I would worry, as others have mentioned, about the common pitfalls in open tagging systems. Namely inconsistencies in taxonomy which may limit utility. That said - one tag I would love to have is parked pages being tagged which would really help analysis and is still something hard to automatically check for.

By the way how are people thinking with regards to implementing this on the csv side? Just a single field that contains square bracketed list for all tags? Quoted nested comma separated lists? JSON as a field? Slash delimited in a field? Multiple fields?

I think when you begin nesting multiple fields in one with csvs, some of the nastier inconsistencies of csv parsing libraries may come out.

jakubd avatar Feb 26 '20 16:02 jakubd