test-lists icon indicating copy to clipboard operation
test-lists copied to clipboard

Test lists v1.5

Open hellais opened this issue 1 year ago • 2 comments

WIP branch to come up with a nicer data format for the future test lists v2 data format.

hellais avatar May 07 '24 07:05 hellais

For metadata in notes, please see #1723 for what Thailand list is trying to put in there for the time being.

bact avatar May 09 '24 04:05 bact

Proposed Metadata - for Discussion/Comments

Some are from the discussion at the iMAP/OONI Partner Gathering 2024.

Webpage status

These are characteristics intrinsic to the webpage/URL itself.

  • Page cannot be found (removed by the site owner)
  • Parking
  • Domain no longer registered
  • Date last known updated, as check by human

Observation status

Data about the observation activity.

  • Date last checked by human
    • Can be blank, if not existed.
    • Can be the same as or different from "Date last known updated, as check by human"
    • Given a fresh URL: if on 2024-05-13, a human look at a blog and found a latest post from the same day, both "Date last known updated, as check by human" and "Date last checked by human" will be 2024-05-13.
    • Later, on 2024-09-01, a human look at the same blog and find no new post. Then "Date last checked by human" will be 2024-09-01 and "Date last known updated, as check by human" will still be 2024-05-13.

Webpage category/Additional information

A category given by human judgement or need knowledge extrinsic to the URL.

  • Remove redundant category_description from the CSV
    • https://github.com/citizenlab/test-lists/issues/582
    • Note: this will break compatibility
      • The other way is to make the field blank (,, in CSV)
      • Or repurpose it, like for an URL description in #380
  • A way to say URL 1 and URL 2 are the same page or related
    • For example, two domains that run by the same organization.
    • In addition to canonical URLs
  • Probing frequency tier ("importance")
    • The "removal" of an URL from probing can also be encoded as frequency = 0
    • https://github.com/citizenlab/test-lists/issues/590

Note: Category now works on at least 3 independent axis/dimensions -- that’s why they overlap a lot:

  • Content (topics): Environment, Human Rights Issues, LGBT, Public Health, Sex Education
  • Container (type of media/technology that hold the content): Hosting and Blogging Platforms, Media sharing, File-sharing, Social Networking
  • Creator (type of organization): Government, Intergovernmental Organizations

But the category rearrangement will break the ability to compare with measurements from projects that use v1.0 version of test list spec.

Use Cases

As discussed, use cases will be very useful for the discussion. As they will allow us to know what kind of metadata, when, and in which way it is best to collect/annotate.

bact avatar May 13 '24 09:05 bact