engine
engine copied to clipboard
Define how to track multi-page documents
Context and Problem Statement
Some documents are divided into several sub-documents accross many web pages. For example, the Community Guidelines for Twitter or Facebook are divided, whereas those for TikTok are written in one document. Currently multi-page documents are not tracked.
Solutions considered
Option 1: Create a document type for each sub-documents
For example in the Twitter.json
declaration file:
{
"name": "Twitter",
"documents": {
…
"Community Guidelines - Hateful conduct policy": {
"fetch": "https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy",
"select": [ "#twtr-main" ],
"filters": "removeReturnToTopButton"
},
"Community Guidelines - Violent and Graphic Content": {
"fetch": "https://help.twitter.com/en/rules-and-policies/violent-groups",
"select": [ "#twtr-main" ],
"filters": "removeReturnToTopButton"
}
}
}
Resulting file structure:
TikTok/
├─ Privacy Policy.md
├─ Terms of Service.md
├─ Community Guidelines.md
Twitter/
├─ Privacy Policy.md
├─ Terms of Service.md
├─ Community Guidelines - Hateful conduct policy.md
├─ Community Guidelines - Violent and Graphic Content.md
Implications
- New document types have to be defined
- A convention on how to handle undivided documents have to be defined
Pros:
- No new major concepts
- Already available, no
archivist
update needed
Cons:
- Multiply documents types
- Look like a workaround
- May lead to inconsistency if some contributors do not follow the convention on how to handle an undivided document. See remaining questions.
Option 2: Concatenate all sub-documents in one document
For example in the Twitter.json
declaration file:
{
"name": "Twitter",
"documents": {
…
"Community Guidelines": [
{
"fetch": "https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy",
"select": [ "#twtr-main" ],
"filters": "removeReturnToTopButton"
},
{
"fetch": "https://help.twitter.com/en/rules-and-policies/violent-groups",
"select": [ "#twtr-main" ],
"filters": "removeReturnToTopButton"
}
]
}
}
Resulting file structure:
TikTok/
├─ Privacy Policy.md
├─ Terms of Service.md
├─ Community Guidelines.md
Twitter/
├─ Privacy Policy.md
├─ Terms of Service.md
├─ Community Guidelines.md
Pros:
- No new major concepts
- Simplify document comparison accross different services as there are only one document
Cons:
- Break the invariant of one snapshot for one version
- Generate a version of a document that do not really exist
- May lead to inconsistency as contributors will have to arbitrarily choose the order of sub-documents
Option 3: Allow sub-documents to be defined in one document as sub-document type
{
"name": "Twitter",
"documents": {
…
"Community Guidelines": {
"Hateful conduct policy": {
"fetch": "https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy",
"select": [ "#twtr-main" ],
"filters": "removeReturnToTopButton"
},
"Violent and Graphic Content": {
"fetch": "https://help.twitter.com/en/rules-and-policies/glorification-of-violence",
"select": [ "#twtr-main" ],
"filters": "removeReturnToTopButton"
}
}
}
}
Resulting file structure:
TikTok/
├─ Privacy Policy.md
├─ Terms of Service.md
├─ Community Guidelines.md
Twitter/
├─ Privacy Policy.md
├─ Terms of Service.md
├─ Community Guidelines/
│ ├─ Hateful conduct policy.md
│ ├─ Violent and Graphic Content.md
Implications:
- Need to define the sub-document type concept and see what it can imply globally
- Need to define allowed sub-document types for each document type
Pros:
- Relatively straightforward concept
Cons:
- New concept increase complexity for new contributors
- Contributors may be tempted to split unified documents into several sub-documents
Option 4: Introduce the concept of sections
{
"name": "Twitter",
"documents": {
…
"Community Guidelines": {
"sections": {
"Hateful conduct policy": {
"fetch": "https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy",
"select": [ "#twtr-main" ],
"filters": "removeReturnToTopButton"
},
"Violent and Graphic Content": {
"fetch": "https://help.twitter.com/en/rules-and-policies/glorification-of-violence",
"select": [ "#twtr-main" ],
"filters": "removeReturnToTopButton"
}
}
}
}
}
Or factorized version:
{
"name": "Twitter",
"documents": {
…
"Community Guidelines": {
"select": [ "#twtr-main" ],
"filters": "removeReturnToTopButton",
"sections": {
"Hateful conduct policy": {
"fetch": "https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy"
},
"Violent and Graphic Content": {
"fetch": "https://help.twitter.com/en/rules-and-policies/glorification-of-violence"
}
}
}
}
}
Resulting file structure:
TikTok/
├─ Privacy Policy.md
├─ Terms of Service.md
├─ Community Guidelines.md
Twitter/
├─ Privacy Policy.md
├─ Terms of Service.md
├─ Community Guidelines/
│ ├─ Hateful conduct policy.md
│ ├─ Violent and Graphic Content.md
Implications:
- Need to define the section concept and see what it can imply globally
Pros:
- Section concept can be used in many other documents, not just those divided into several sub-documents
- Can increase metadata tracked by OTA
Cons:
- New concept increase complexity for new contributors
Remaining questions:
-
For options 1, 3 and 4, about consistency:
Do already unified documents have to be split in many documents for consistency? (Option B)
Which resulting file structure is expected:
Option A:
TikTok/ ├─ Privacy Policy.md ├─ Terms of Service.md ├─ Community Guidelines.md Twitter/ ├─ Privacy Policy.md ├─ Terms of Service.md ├─ Community Guidelines/ │ ├─ Hateful conduct policy.md │ ├─ Violent and Graphic Content.md
Option B:
TikTok/ ├─ Privacy Policy.md ├─ Terms of Service.md ├─ Community Guidelines/ │ ├─ Hateful conduct policy.md │ ├─ Violent and Graphic Content.md Twitter/ ├─ Privacy Policy.md ├─ Terms of Service.md ├─ Community Guidelines/ │ ├─ Hateful conduct policy.md │ ├─ Violent and Graphic Content.md
-
For option 2, about snapshots:
How to store snapshots? with suffix in their filename? (like $documentType-part-1.html, $documentType-part-2.html, …)
Which snapshot ID is used as reference for related version?
How do we store snapshot ID used as reference in git version commit?
Possible solutions:
Start tracking Community Guidelines/Hateful conduct policy
This version was recorded after filtering snapshots with Mongo IDs:
- $id1
- $id2
- $id3
Start tracking Community Guidelines/Hateful conduct policy
This version was recorded after filtering snapshots:
- https://github.com/OpenTermsArchive/snapshots-dating/commit/$id1
- https://github.com/OpenTermsArchive/snapshots-dating/commit/$id2
- https://github.com/OpenTermsArchive/snapshots-dating/commit/$id3
-
For option 3, about nesting:
Which nesting level is allowed?
-
For option 3 and 4, about storage:
How do we store sub-document in git commit?
Start tracking Community Guidelines/Hateful conduct policy This version was recorded after filtering snapshot with Mongo $id
How do we store section in git commit?
Start tracking Community Guidelines#Hateful conduct policy This version was recorded after filtering snapshot with Mongo $id
Questions to bear in mind when choosing an appropriate solution:
- What does each solution involve in adding document?
- What does each solution involve in document maintenance?
- What does each solution imply for the history system, dataset generation and rewriting process?
Some thoughts
- After discussion, it seems that option 2 can be abandoned mainly because it generates a document that does not really exist.
- Options 3 and 4 seem very similar and it may appear that section and sub-document are different terms for the same underlying concept. But in fact they imply really different things. The concept of a sub-document type is similar to the existing document type, it only adds the concept of nesting. So, sub-document types could be defined and centralized for document types where it makes sense, and use with parsimony. This solution implies no arbitrary choice from contributors. Whereas, the concept of section is more flexible. Sections could be arbitrarily chosen by contributors and it can be used in all document types without having a centralized definition. And even if allowed sections for a document type are defined and centralized to avoid having inconsistency between documents, the concept itself suggest a more open usage.
- In the long term, it seems that option 3 and option 4 will coexist as they bring different elements. But in the short term, it seems that option 3 is the most appropriate to the problem from a conceptual point of view.
Thanks for this very detailed explanation.
I believe the sections AND sub-documents types must be centralized as if not, it may result in unuseable datasets.
So sub documents would have to be defined relative to their parent document type.
In that sense, I do not see that much difference between option 3 and 4 anymore but would rather go for the option 4 syntax, which permits the factorizing of select
and filters
Apart from that, I think that kind of analysis should reside in the discussion section https://github.com/ambanum/OpenTermsArchive/discussions
And two issues should be create afterwards
- create a decision record
- implement the chosen solution
For option 4, I suggest an update to the factorised version as I find it more understandable:
{
"name": "Twitter",
"documents": {
…
"Community Guidelines": {
"sections": {
"select": [ "#twtr-main" ],
"filters": "removeReturnToTopButton",
"Hateful conduct policy": {
"fetch": "https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy"
},
"Violent and Graphic Content": {
"fetch": "https://help.twitter.com/en/rules-and-policies/glorification-of-violence"
}
}
}
}
}
We have strong time pressure to support multi-page documents for Community Guidelines in the context of the French presidential election.
A first analysis of the ability to align Community Guidelines subdocuments is not very conclusive: we can cover with shared types between 100% (TikTok, LinkedIn) and 60% (Twitter), through 80% (YouTube, Instagram, Facebook) of Community Guidelines subdocuments. Option 1 would mean losing the non-covered ones; option 2 would mean creating a non-existing, virtual document; options 3 and 4 would mean opening up divergence for documents and making them incomparable. It seems impossible do decide what is most appropriate for Open Terms Archive at this stage.
Thus, we'll use real options to try out both options 1 and 4 in parallel, as they seem to be the most sustainable and the most divergent. If we have enough time, we'll also try option 3.
This means 2 (or 3) instances from experimental feature branches will run in parallel on a dedicated server. We will track documents this way and feed the results to analysts. We will conclude on the effectiveness and relevance of each option end of April.
I will share here data on Community Guidelines alignment this week.
As discussed with @clementbiron, we should also track the index page of Community Guidelines as it may contain important content.
It will therefore have an impact on each option as follows:
Option 1:
{
"name": "Twitter",
"documents": {
…
"Community Guidelines": {
"fetch": "https://help.twitter.com/en/rules-and-policies",
"select": [ "#twtr-main" ],
},
"Community Guidelines - Hateful conduct policy": {
"fetch": "https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy",
"select": [ "#twtr-main" ],
"filters": "removeReturnToTopButton"
},
"Community Guidelines - Violent and Graphic Content": {
"fetch": "https://help.twitter.com/en/rules-and-policies/violent-groups",
"select": [ "#twtr-main" ],
"filters": "removeReturnToTopButton"
}
}
}
Resulting file structure:
TikTok/
├─ Privacy Policy.md
├─ Terms of Service.md
├─ Community Guidelines.md
Twitter/
├─ Privacy Policy.md
├─ Terms of Service.md
├─ Community Guidelines.md
├─ Community Guidelines - Hateful conduct policy.md
├─ Community Guidelines - Violent and Graphic Content.md
Option 2: Not relevant
Option 3:
{
"name": "Twitter",
"documents": {
…
"Community Guidelines": {
"fetch": "https://help.twitter.com/en/rules-and-policies",
"select": "#main",
"Hateful conduct policy": {
"fetch": "https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy",
"select": [ "#twtr-main" ],
"filters": "removeReturnToTopButton"
},
"Violent and Graphic Content": {
"fetch": "https://help.twitter.com/en/rules-and-policies/glorification-of-violence",
"select": [ "#twtr-main" ],
"filters": "removeReturnToTopButton"
}
}
}
}
Resulting file structure:
TikTok/
├─ Privacy Policy.md
├─ Terms of Service.md
├─ Community Guidelines.md
Twitter/
├─ Privacy Policy.md
├─ Terms of Service.md
├─ Community Guidelines.md
├─ Community Guidelines/
│ ├─ Hateful conduct policy.md
│ ├─ Violent and Graphic Content.md
Option 4:
{
"name": "Twitter",
"documents": {
"Privacy Policy": {
"fetch": "https://twitter.com/en/privacy",
"select": ["main"]
},
"Community Guidelines": {
"fetch": "https://help.twitter.com/en/rules-and-policies",
"select": "#main",
"sections": {
"select": [ "#twtr-main" ],
"filters": "removeReturnToTopButton",
"Hateful conduct policy": {
"fetch": "https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy"
},
"Violent and Graphic Content": {
"fetch": "https://help.twitter.com/en/rules-and-policies/glorification-of-violence"
}
}
}
}
}
Resulting file structure:
TikTok/
├─ Privacy Policy.md
├─ Terms of Service.md
├─ Community Guidelines.md
Twitter/
├─ Privacy Policy.md
├─ Terms of Service.md
├─ Community Guidelines.md
├─ Community Guidelines/
│ ├─ Hateful conduct policy.md
│ ├─ Violent and Graphic Content.md
Community guidelines ontology
@Ndpnt collected the titles of the Community Guidelines subtypes of Facebook, Instagram, YouTube, Twitter, LinkedIn and TikTok. I then tried to align each of these subtypes and to give a more generic name that would be fit for Open Terms Archive.
Interesting points
- Facebook and Instagram share the exact same subtypes.
- LinkedIn and TikTok have only one document with subtitles; it makes sense to use them for sections, but not for separate documents.
- Twitter has significantly more entries than all others.
Aligned
When lines have an empty header, this means there is ambiguity as to which platform document should get this type.
I believe that each “subtype” should be prefixed with Community Guidelines —
.
Open Terms Archive Candidate Subtype | Facebook & Instagram | YouTube | TikTok | ||
---|---|---|---|---|---|
Self-harm | Safety - Suicide and Self-Injury | Sensitive content - Suicide and self-harm | Safety and cybercrime - Suicide and Self-harm Policy | Do not share harmful or shocking material | Suicide, self-harm, and disordered eating |
Hate Speech | Objectionable Content - Hate Speech | Violent or dangerous content - Hate speech | Safety and cybercrime - Hateful conduct policy | Do not be hateful | Hateful behavior |
Child Sexual Exploitation | Safety - Child Sexual Exploitation, Abuse and Nudity | Sensitive content - Child safety | Safety and cybercrime - Child sexual exploitation policy | Minor Safety | |
Violence Incitement | Violence And Criminal Behavior - Violence and Incitement | Violent or dangerous content - Harmful or dangerous content | Safety and cybercrime - Glorification of violence policy | Do not threaten, incite, or promote violence | Dangerous acts and challenges |
Objectionable Content - Violent and Graphic Content | Violent or dangerous content - Violent or graphic content | Safety and cybercrime - Sensitive media policy | Violent and graphic content | ||
Violent Organizations | Violence And Criminal Behavior - Dangerous Individuals and Organizations | Violent or dangerous content - Violent criminal organizations | Safety and cybercrime - Violent organizations policy | Do not post terrorist content or promote terrorism | Violent extremism |
Violence And Criminal Behavior - Coordinating Harm and Promoting Crime | |||||
Spam | Integrity And Authenticity - Spam | Spam & deceptive practices - Spam, deceptive practices & scams | Platform integrity and authenticity - Platform manipulation and spam policy | Do not engage in spam or scam | Integrity and authenticity |
Violence And Criminal Behavior - Fraud and Deception | Platform integrity and authenticity - Financial scam policy | ||||
Regulated Goods | Violence And Criminal Behavior - Restricted Goods and Services | Regulated goods - Sale of illegal or regulated goods or services | Safety and cybercrime - Illegal or certain regulated goods or services | Illegal activities and regulated goods | |
Harassment | Safety - Bullying and Harassment | Violent or dangerous content - Harassment and cyberbullying | Safety and cybercrime - Abusive behavior | Do not harass or bully | Bullying and harassment |
Platform integrity and authenticity - Coordinated harmful activity | |||||
Regulated goods - Firearms | |||||
Misinformation | Integrity And Authenticity - Misinformation | Misinformation - Misinformation | Do not share false or misleading content | ||
Misinformation - Elections misinformation | |||||
Misinformation - COVID-19 medical misinformation | Platform integrity and authenticity - COVID-19 misleading information policy | ||||
Misinformation - Vaccine misinformation | |||||
Intellectual Property | Respecting Intellectual Property - Intellectual Property | Intellectual property - Copyright policy | Respect the intellectual property of others and do not violate the intellectual property rights of others | Copyright and trademark infringement | |
Intellectual property - Counterfeit policy | |||||
Intellectual property - Trademark policy | |||||
Intellectual property - Automated copyright claims for live video | |||||
Adult Nudity | Objectionable Content - Adult Nudity and Sexual Activity | Sensitive content - Nudity and sexual content | Adult nudity and sexual activities | ||
Sexual Solicitation | Objectionable Content - Sexual Solicitation | Do not engage in unwanted advances | |||
Inauthentic Behaviour / Platform Manipulation | Integrity And Authenticity - Inauthentic Behavior | Spam & deceptive practices - Fake engagement | Platform integrity and authenticity - Platform manipulation and spam policy | Interference with LinkedIn | Platform security |
Privacy Violations | Safety - Privacy Violations | Safety and cybercrime - Private information policy | Respect others' privacy | ||
Integrity And Authenticity - Account Integrity and Authentic Identity | Platform integrity and authenticity - Impersonation policy | Do not create a fake profile or falsify information about yourself | |||
Safety - Adult Sexual Exploitation | Safety and cybercrime - Non-consensual nudity policy | ||||
Reach Amplification | Platform Use Guidelines - About specific instances when a Tweet’s reach may be limited | Ineligible for the For You Feed | |||
Overview | General - The Twitter Rules | ||||
Scraping | Unauthorized access and use | ||||
Terms Updates | Platform Use Guidelines - Updates to our Terms of Service and Privacy Policy | ||||
Deceased Users | General - Deceased individuals |
Unclassified
I did not manage to align these documents. They should either be read in full to understand where they could fit, or be left out.
Facebook & Instagram | YouTube | |
---|---|---|
Safety - Human Exploitation | Sensitive content - Vulgar language | Platform integrity and authenticity - Distribution of hacked materials policy |
Integrity And Authenticity - Cybersecurity | Spam & deceptive practices - Impersonation | Platform integrity and authenticity - Ban evasion policy |
Content-Related Requests And Decisions - User Requests | Spam & deceptive practices - External links | Platform integrity and authenticity - Parody, newsfeed, commentary, and fan account policy |
Content-Related Requests And Decisions - Additional Protection of Minors | Spam & deceptive practices - Additional policies | Platform integrity and authenticity - Civic integrity policy |
Integrity And Authenticity - Memorialization | Platform integrity and authenticity - Synthetic and manipulated media policy | |
General - Username squatting policy | ||
Safety and cybercrime - Violent threats policy |
Platform specific
These documents depend on platform features and have no reason to be tracked with a shared name.
With option 1, they would be dropped.
YouTube | |
---|---|
Sensitive content - Thumbnails | Platform Use Guidelines - Twitter Moments guidelines and principles |
Spam & deceptive practices - Playlists | Platform Use Guidelines - Notices on Twitter and what they mean |
Platform Use Guidelines - Curation style guide | |
Platform Use Guidelines - Super Follows policy | |
Platform Use Guidelines - Ticketed Spaces policy |
Country specific
Twitter has a document named “Platform Use Guidelines - Reporting false information in France”.
On top of all of the above documents, Twitter goes really deep in specification and also adds some usage guidance that could be considered as parts of a manual.
- Platform Use Guidelines - Report violations
- Platform Use Guidelines - Our range of enforcement options
- Platform Use Guidelines - Fair use policy
- Platform Use Guidelines - Content Monetization Standards
- Platform Use Guidelines - Guidelines for Promotions on Twitter
- Platform Use Guidelines - About search rules and restrictions
- Platform Use Guidelines - Twitter, our services, and corporate affiliates
- Platform Use Guidelines - How to report security vulnerabilities
- Platform Use Guidelines - About Twitter limits
- Platform Use Guidelines - Defending and respecting the rights of people using our service
- Platform Use Guidelines - About rules and best practices with account behaviors
- Platform Use Guidelines - About Twitter’s APIs
- Platform Use Guidelines - About government and state-affiliated media account labels on Twitter
- Platform Use Guidelines - Automation rules
- Platform Use Guidelines - Inactive account policy
- Platform Use Guidelines - About country withheld content
- Platform Use Guidelines - About public-interest exceptions on Twitter
- Platform Use Guidelines - Additional information about data processing
- Platform Use Guidelines - Our approach to policy development and enforcement philosophy
The above table has been implemented in #778.
Great to see these detailed discussions! With a lot of these things we have struggled at www.pga.hiig.de as well – and only responded with manual curation. I am curious to look at the results of the test runs – where do you stand currently with regard to the decision? I must confess that conceptually I am much more inclined to go for option 4 (or 3) than option 1. As part of our work at the PGA we have seen how all major platforms have evolved their community guidelines from single-pages documents into these nested websites of explanation. So I'd very much argue that this is "one thing" but that is has gotten much more complex over the years. And the wording and categorization is also changing over time. So this will remain a challenge – but much better to keep this under the umbrella of "community guidelines" than to have 10-20 separate document type that change names every other year and also re-integrate, bifurcate etc. This space is very much in flux.
Thanks @ckatzenbach! This idea that this space will keep on evolving is very relevant indeed. Even if we happened to succeed to create an ontology for the current document set, we have to assess the chance that it would be stable over time.
where do you stand currently with regard to the decision?
As mentioned in https://github.com/ambanum/OpenTermsArchive/issues/773#issuecomment-1066641451, we are collecting data and feedback and intend to conclude on the effectiveness and relevance of each option end of April 🙂
Hi everyone, I'm Adrian and I worked with @ckatzenbach on the (historical) collection of these multi-page documents for the Platform Governance Archive. As he said, we ran into some of the exact same issues and questions during our collection process so it is very interesting to read your discussion here! Maybe it is helpful for you to hear about our experience and the solution that we ended up with.
The first realization that we had when investigating the historical evolution of platform policies and collecting the documents is that what you called "the ontology of the Community Guidelines" is sometimes not as straightforward as one might expect.
In the case of Facebook, it is still relatively clear from my perspective. Here, the Community Guidelines (initially called 'Content Code of Conduct' then 'Facebook Community Standards') evolved from a document that was completely displayed on one URL into first an interactive document with drop-down sections and then a multi-page document. But even today as it is spread across many different URLs, I think it is very clear that Facebook considers all of these subsites as part of one document: The Facebook Community Standards.
Grouping these subsites into one document, from my perspective, does therefore not mean creating an artificial document. Much rather, the historical evolution of Facebook's Community Standards shows that the multi-page format should not be seen as a splitting up of the Community Guidelines into many sub-parts but much rather the contemporary form of displaying the document and making it easier to navigate for users. From my perspective, it therefore makes sense to puzzle the different Community Standards back together into one document because that's what they are from Facebook's/the user perspective and because that creates a document that can be compared to other platforms' Community Guidelines.
Now in the case of Twitter, it is a bit more hard to define what actually constitutes their Community Guidelines: Do they, as you suggest, encompass all of the 75 subsites currently linked on their "Rules and policies" overview page (https://help.twitter.com/en/rules-and-policies#general)? Or should they rather be understood as "The Twitter Rules" page (https://help.twitter.com/en/rules-and-policies/twitter-rules) and the 18 selected sub policies that are linked there? (This is the option that we went for).
These two options by themselves actually raised an ontological question for the collection: Are the Community Guidelines what platforms define as their Community Guidelines or do they actually encompass all of the platforms' rules that regulate their community in some way? That would mean that also rules or policies that are spelled out in sections of a site which are not part of the officially defined "Community Guidelines", for instance on help pages - as it very often happens - would also form part of a platform's Community Guidelines. For reasons of feasibility and practicability, we opted for taking the platforms own definition of their "Community Guidelines" as the reference point for our collection.
In the case of Twitter, this meant considering "The Twitter Rules" page as their Community Guidelines, because this is what the company has generally and historically considered as their Community Guidelines. Our team member João can explain this decision in more detail because he went deep into the history of the Twitter Rules. Another argument for this approach would be that, as @MattiSG also noted, Twitter's "Rules and Policies" page includes many usage guidance/information pages such as "Updates to our Terms of Service and Privacy Policy" which are probably better classified as help pages than as rules/policies for the community.
For Twitter, we hence decided to collect "The Twitter Rules", meaning that we collected the main page and the first sublevel of the policies that are linked on this page (if I understood it correctly this is what you referred to as nesting level). In practice this meant that we first had to create a timeline which denotes when subpolicies became part or where removed from the Twitter Rules page. It is important to note, that some of these subpages existed before they became part of the Twitter Rules or continue to exist after they are removed from the index pages. I guess as a general takeaway this means that taking an index page as the starting point for the collection entails monitoring when sublinks appear/are removed from this page. This is due to the fact that sections are sometimes merged or added to/removed from the master document.
I have to admit that I did not understand all of the technicalities of your discussion above regarding the difference between option 3 and 4, so I cannot say how all of this influences your decision or speaks in favor of one or the other option. Generally however I would say that:
- Compiling subsections from different URLs into one document does not necessarily create an artificial document
- In terms of document maintenance, defining an index page as a starting point would entail automatically or manually monitoring when new sublinks/URLs are added to/removed from overarching page
- I find your grouping of the subsections very impressive and interesting for the comparison of specific Community Guideline sections but it does in my eyes not erase the meaningfulness of also having one compiled version of all rules
- I agree that treating all subsections as their own document as in option 1 probably leads to a level of complexity in which it is hard to keep an overview
I'm very sorry for the length of this post and hope this is in any way helpful for your decision! Its actually quite helpful for us to spell our procedure out again in this discussion :)
Comparison of implemented options 1 and 4
As announced, we compared the results of running side-by-side implementations of options 1 and 4 for 7 weeks. Here are our results and observations 🙂
Common observations
- Most community guidelines document could be tracked within a fixed types ontology, and detecting those changes did yield value to analysts.
- Open Terms Archive scaled well with no other modification than additional document types.
- Mass changes triggered notifications across many documents, leading to spam, as when Twitter mangled URLs (see https://github.com/OpenTermsArchive/france-elections-versions/commit/3e472cd, https://github.com/OpenTermsArchive/france-elections-versions/commit/d31174e and 5 other documents). This can happen with any other set of documents from the same service, but is made worse with the given implementation since the number of documents is significantly larger.
- Listing sections of documents risks pushing contributors towards wanting to list sections for arbitrary document types, which is not supported.
Declarations
For Facebook, the resulting declaration was 101 lines for option 1 vs 83 lines for option 4. You can find them in full below.
Option 1
{
"name": "Facebook",
"documents": {
"Privacy Policy": {
"fetch": "https://fr-fr.facebook.com/privacy/explanation/",
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"select": ["div[role=\"main\"]"],
"remove": ["._5tko"],
"executeClientScripts": true
},
"Terms of Service": {
"fetch": "https://fr-fr.facebook.com/legal/terms/plain_text_terms",
"select": ["div[role=\"main\"]"],
"remove": ["footer[role=\"contentinfo\"]"],
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"executeClientScripts": true
},
"Community Guidelines": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards",
"select": ["._9ntw"],
"remove": ["._9nxl", "._9ntv", ".img"]
},
"Community Guidelines - Self-harm": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/suicide-self-injury/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
"Community Guidelines - Hate Speech": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/hate-speech/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
"Community Guidelines - Child Exploitation": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/child-sexual-exploitation-abuse-nudity/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
"Community Guidelines - Violence Incitement": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/violence-incitement/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
"Community Guidelines - Violent Organizations": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/dangerous-individuals-organizations/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
"Community Guidelines - Spam": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/spam/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
"Community Guidelines - Regulated Goods": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/regulated-goods/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
"Community Guidelines - Harassment": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/bullying-harassment/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
"Community Guidelines - Misinformation": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/misinformation/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
"Community Guidelines - Intellectual Property": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/intellectual-property/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
"Community Guidelines - Adult Nudity": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/adult-nudity-sexual-activity/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
"Community Guidelines - Sexual Solicitation": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/sexual-solicitation/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
"Community Guidelines - Platform Manipulation": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/inauthentic-behavior/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
"Community Guidelines - Privacy Violations": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/privacy-violations-image-privacy-rights/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
"Deceased Users": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/memorialization/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
}
}
}
Option 4
{
"name": "Facebook",
"documents": {
"Privacy Policy": {
"fetch": "https://fr-fr.facebook.com/privacy/explanation/",
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"select": ["div[role=\"main\"]"],
"remove": ["._5tko"],
"executeClientScripts": true
},
"Terms of Service": {
"fetch": "https://fr-fr.facebook.com/legal/terms/plain_text_terms",
"select": ["div[role=\"main\"]"],
"remove": ["footer[role=\"contentinfo\"]"],
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"executeClientScripts": true
},
"Community Guidelines": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards",
"select": ["._9ntw"],
"remove": ["._9nxl", "._9ntv", ".img"],
"sections": {
"select": ["._9nrm", "._9q49", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"],
"Self-harm": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/suicide-self-injury/"
},
"Hate Speech": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/hate-speech/"
},
"Child Exploitation": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/child-sexual-exploitation-abuse-nudity/"
},
"Violence Incitement": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/violence-incitement/"
},
"Violent Organizations": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/dangerous-individuals-organizations/"
},
"Spam": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/spam/"
},
"Regulated Goods": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/regulated-goods/"
},
"Harassment": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/bullying-harassment/"
},
"Misinformation": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/misinformation/"
},
"Intellectual Property": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/intellectual-property/"
},
"Adult Nudity": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/adult-nudity-sexual-activity/"
},
"Sexual Solicitation": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/sexual-solicitation/"
},
"Platform Manipulation": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/inauthentic-behavior/"
},
"Privacy Violations": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/privacy-violations-image-privacy-rights/"
}
}
},
"Deceased Users": {
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/memorialization/",
"select": ["._9nrm", "._9q49", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
}
}
}
- Factoring selectors in option 4 improves readability compared to option 1.
- Avoiding repetition of type prefix in option 4 improves readability compared to option 1.
Snapshots and versions
-
The folder in option 4 is surprising: why is this subtype more present than others?
-
Having both a folder and a file with the same name in option 4 is surprising: why is this both a document and a folder?
-
The overrepresentation of Community Guidelines files in option 1 is surprising: why is this subtype more present than others? It also harms readability of the whole folder.
-
Listing sections of documents risks making the user want to list sections for other document types.
Reliability and maintenance
No difference was measured between the two options.
Conclusion
After implementation, none of the experimented solutions emerge as a clear winner. Along with comments from the PGA team (thanks @pg-adrian for your detailed message 🙇), this reinforces the validity of option 2, where all documents are consolidated into a single one.
The blocking point that was identified was the risk of voiding the promise that documents tracked by Open Terms Archive can hold in court, since the resulting document could not easily be cited as it can not be referenced by a single URL or date.
However, this coud be handled by ensuring snapshots are still integral copies of the documents found online. While admittedly to a lower extent, versions are already “recreated” from snapshots. Thus, as long as the resulting version references its source snapshots properly, it seems acceptable to consolidate them as part of a minimal readability improvement process.
Detailed proposals for implementation of option 2 will be published in this RFC by next week.
Reframing of problem statement
Declaring, maintaining and analysing the results of tracking community guidelines of several services, along with comments in this RFC, led us to the following reframing. This reframing does not impact the currently explored solution space, but it will hopefully help in avoiding confusion and managing expectations.
Defining pages vs sections
The case of community guidelines is one where service providers have implemented their content sectioning by spreading it across separate web pages.
However, this is not the only solution that they use. Solutions such as accordions have the same aim as splitting across pages: improving legibility. In the case of accordions, we have no problem representing the folded information in a single continuous flow. Similarly, we already split some pages into documents: when Terms of Service and Privacy Policy are on the same webpage, declaration files enable selecting each of those independently through select
, even though they share the same fetch
source page. This demonstrates that we do prioritise documents over source pages.
In the case of this RFC, most of the proposed solutions made the confusion between sections and pages. This concept was not present in Open Terms Archive until now, because all the documents we handled were inside 1 or 0 page. Community Guidelines demonstrate the possibility of having 2 or more pages constituting one document.
Distinguishing pages and sections support in Open Terms Archive
These topics are different, and it is important not to mix them. One is about how to track a document that is split across multiple pages through its sections. The other is how to annotate sections across a document, no matter if they are on a single page or on many.
Postponing section support
While section support has value, it should not impact the data collection phase: Open Terms Archive is unique in part thanks to its separation of snapshots and versions. Section splitting (or annotations) should be another additional step in the pipeline, one that never puts snapshots authenticity or version consolidation at risk.
There are known users for such a feature: Apolia created their own script to split content last year; ToS;DR does split terms into sections to enable annotating and ranking them.
However, this RFC aims at extending OTA's current solid implementation of document tracking towards multi-page documents. The opportunity of adding section support at the same time is misleading. Such a feature will be handled independently, in its own timeframe.
Great to hear that our experiences were helpful for you in reframing the problem statement and specifying your conceptualization of the relationship between documents, sections and pages and looking forward to read about the proposals for the implementation of option 2.
Proposals for implementation of option 2
Option 2A:
Declare an array instead of an object for a document type where each entries of this array is a document declaration.
{
"name": "Facebook",
"documents": {
"Privacy Policy": {
"fetch": "https://fr-fr.facebook.com/privacy/explanation/",
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"select": ["div[role=\"main\"]"],
"remove": ["._5tko"],
"executeClientScripts": true
},
"Terms of Service": {
"fetch": "https://fr-fr.facebook.com/legal/terms/plain_text_terms",
"select": ["div[role=\"main\"]"],
"remove": ["footer[role=\"contentinfo\"]"],
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"executeClientScripts": true
},
"Community Guidelines": [
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards",
"select": ["._9ntw"],
"remove": ["._9nxl", "._9ntv", ".img"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/suicide-self-injury/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/hate-speech/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/child-sexual-exploitation-abuse-nudity/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/violence-incitement/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/dangerous-individuals-organizations/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/spam/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/regulated-goods/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/bullying-harassment/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/misinformation/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/intellectual-property/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/adult-nudity-sexual-activity/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/sexual-solicitation/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/inauthentic-behavior/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/privacy-violations-image-privacy-rights/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72", "svg", "._9ooi", "._9q3_"]
}
]
}
}
Option 2B:
Declare an array for each document keys inside a document declaration.
{
"name": "Facebook",
"documents": {
"Privacy Policy": {
"fetch": "https://fr-fr.facebook.com/privacy/explanation/",
"filter": ["removeEmptyAnchorsLinks", "removeTrackingIDs", "removeLocaleFromUrls"],
"select": ["div[role=\"main\"]"],
"remove": ["._5tko"],
"executeClientScripts": true
},
"Terms of Service": {
"fetch": "https://fr-fr.facebook.com/legal/terms/plain_text_terms",
"select": ["div[role=\"main\"]"],
"remove": ["footer[role=\"contentinfo\"]"],
"filter": ["removeEmptyAnchorsLinks", "removeTrackingIDs", "removeLocaleFromUrls"],
"executeClientScripts": true
},
"Community Guidelines": {
"fetch": [
"https://transparency.fb.com/fr-fr/policies/community-standards", ,
"https://transparency.fb.com/fr-fr/policies/community-standards/suicide-self-injury/",
"https://transparency.fb.com/fr-fr/policies/community-standards/hate-speech/",
"https://transparency.fb.com/fr-fr/policies/community-standards/child-sexual-exploitation-abuse-nudity/",
"https://transparency.fb.com/fr-fr/policies/community-standards/violence-incitement/",
"https://transparency.fb.com/fr-fr/policies/community-standards/dangerous-individuals-organizations/",
"https://transparency.fb.com/fr-fr/policies/community-standards/spam/",
"https://transparency.fb.com/fr-fr/policies/community-standards/regulated-goods/",
"https://transparency.fb.com/fr-fr/policies/community-standards/bullying-harassment/",
"https://transparency.fb.com/fr-fr/policies/community-standards/misinformation/",
"https://transparency.fb.com/fr-fr/policies/community-standards/intellectual-property/",
"https://transparency.fb.com/fr-fr/policies/community-standards/adult-nudity-sexual-activity/",
"https://transparency.fb.com/fr-fr/policies/community-standards/sexual-solicitation/",
"https://transparency.fb.com/fr-fr/policies/community-standards/inauthentic-behavior/",
"https://transparency.fb.com/fr-fr/policies/community-standards/privacy-violations-image-privacy-rights/",
"https://transparency.fb.com/fr-fr/policies/community-standards/memorialization/"
],
"select": [
"._9ntw",
"._9nrm, ._9p7c",
"._9nrm, ._9p7c",
"._9nrm, ._9p7c",
"._9nrm, ._9p7c",
"._9nrm, ._9p7c",
"._9nrm, ._9p7c",
"._9nrm, ._9p7c",
"._9nrm, ._9p7c",
"._9nrm, ._9p7c",
"._9nrm, ._9p7c",
"._9nrm, ._9p7c",
"._9nrm, ._9p7c",
"._9nrm, ._9p7c",
"._9nrm, ._9p7c",
"._9nrm, ._9p7c"
],
"remove": [
"._9nxl, ._9ntv, .img",
"._9p72, svg, ._9ooi, ._9q3_",
"._9p72, svg, ._9ooi, ._9q3_",
"._9p72, svg, ._9ooi, ._9q3_",
"._9p72, svg, ._9ooi, ._9q3_",
"._9p72, svg, ._9ooi, ._9q3_",
"._9p72, svg, ._9ooi, ._9q3_",
"._9p72, svg, ._9ooi, ._9q3_",
"._9p72, svg, ._9ooi, ._9q3_",
"._9p72, svg, ._9ooi, ._9q3_",
"._9p72, svg, ._9ooi, ._9q3_",
"._9p72, svg, ._9ooi, ._9q3_",
"._9p72, svg, ._9ooi, ._9q3_",
"._9p72, svg, ._9ooi, ._9q3_",
"._9p72, svg, ._9ooi, ._9q3_",
"._9p72, svg, ._9ooi, ._9q3_"
]
}
}
}
Option 2C:
A kind of mix of Option 2a and Option 2b where only the fetch
key can accept an array. It allows to factorize select
, remove
and filters
for an array of pages to fetch.
{
"name": "Facebook",
"documents": {
"Privacy Policy": {
"fetch": "https://fr-fr.facebook.com/privacy/explanation/",
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"select": ["div[role=\"main\"]"],
"remove": ["._5tko"],
"executeClientScripts": true
},
"Terms of Service": {
"fetch": "https://fr-fr.facebook.com/legal/terms/plain_text_terms",
"select": ["div[role=\"main\"]"],
"remove": ["footer[role=\"contentinfo\"]"],
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"executeClientScripts": true
},
"Community Guidelines": [
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards",
"select": ["._9ntw"],
"remove": ["._9nxl"]
},
{
"fetch": [
"https://transparency.fb.com/fr-fr/policies/community-standards/suicide-self-injury/",
"https://transparency.fb.com/fr-fr/policies/community-standards/hate-speech/",
"https://transparency.fb.com/fr-fr/policies/community-standards/child-sexual-exploitation-abuse-nudity/",
"https://transparency.fb.com/fr-fr/policies/community-standards/violence-incitement/"
],
"select": ["._9nrm"],
"remove": ["._9p72"]
},
{
"fetch": [
"https://transparency.fb.com/fr-fr/policies/community-standards/dangerous-individuals-organizations/",
"https://transparency.fb.com/fr-fr/policies/community-standards/spam/",
"https://transparency.fb.com/fr-fr/policies/community-standards/bullying-harassment/",
"https://transparency.fb.com/fr-fr/policies/community-standards/misinformation/",
"https://transparency.fb.com/fr-fr/policies/community-standards/intellectual-property/",
"https://transparency.fb.com/fr-fr/policies/community-standards/adult-nudity-sexual-activity/",
"https://transparency.fb.com/fr-fr/policies/community-standards/sexual-solicitation/",
"https://transparency.fb.com/fr-fr/policies/community-standards/inauthentic-behavior/",
"https://transparency.fb.com/fr-fr/policies/community-standards/privacy-violations-image-privacy-rights/"
],
"select": ["._9ntw"],
"remove": ["._9nxl", "._9ntv", ".img"]
}
]
}
}
Option 2D:
Add a pages
key to the document declaration which is an array that can accept document declarations. When a required key is not defined, this specific key defined at the root of the document declaration is used. It also allows to factorize select
, remove
and filters
.
{
"name": "Facebook",
"documents": {
"Privacy Policy": {
"fetch": "https://fr-fr.facebook.com/privacy/explanation/",
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"select": ["div[role=\"main\"]"],
"remove": ["._5tko"],
"executeClientScripts": true
},
"Terms of Service": {
"fetch": "https://fr-fr.facebook.com/legal/terms/plain_text_terms",
"select": ["div[role=\"main\"]"],
"remove": ["footer[role=\"contentinfo\"]"],
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"executeClientScripts": true
},
"Community Guidelines": {
"select": ["._9ntw"],
"remove": ["._9nxl", "._9ntv", ".img"],
"pages": [
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/suicide-self-injury/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/hate-speech/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/child-sexual-exploitation-abuse-nudity/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/violence-incitement/" },
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/dangerous-individuals-organizations/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/spam/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72"]
},
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/regulated-goods/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/bullying-harassment/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/misinformation/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/intellectual-property/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/adult-nudity-sexual-activity/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/sexual-solicitation/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/inauthentic-behavior/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/privacy-violations-image-privacy-rights/" }
]
}
}
}
Thank you very much @Ndpnt for this work of consolidation!
Reading through these examples, I am quite clearly in favour of 2D. I like how having a dedicated key makes it explicit and enables validation with no ambiguity: either we have a fetch
or we have a pages
key, there is no mysterious syntax to know about arrays.
In particular, I dislike 2A and 2C because they are ambiguous on the intention: what does it mean that a document type maps to an array? Are there as many Community Guidelines
as there are entries in the array?
2B, while not elegant, at least seems at the right level of nesting to me.
I have one suggestion for improvement though. For the moment, we voluntarily stuck to using verbs for every entry of a document declaration. I would suggest that we keep that behaviour, and use a verb such as merge
, assemble
, consolidate
, join
, combine
, fuse
, meld
…
Naming
In order to sort through synonyms, I ran a Google Trends search to find the most common term. The most common term for this operation seems to be “to merge”, followed by “to combine”. I confirmed this by running the same search with “PDF” or “pages” instead of “documents”.
data:image/s3,"s3://crabby-images/bdde2/bdde243d74b3eeb8c86e9bf995a7962323b9c484" alt="Screen Shot 2022-05-15 at 16 59 48"
However, I am bit concerned that, in a context of operation where we rely a lot on Git, “merge” becomes ambiguous with the eponymous Git operation, when it is something very different that we want to describe here. Thus, I offer:
Option 2.D.i
Same as 2D, just renaming pages
to combine
. I also suggest to write the factored keys after the combine
key, in order to further distinguish with the (much more common) single-page declarations.
Support a combine
key in document declarations, that contains an array of objects with fetch
and optionally select
, remove
, filter
keys; in this case, the select
, remove
, filter
specified at the same level as combine
are considered as default for every entry in the array.
Formal definition
- Redefine document declaration as single-page declaration or multipage declaration.
- Define page declaration as almost the same as the current document declaration, with its
select
key is made optional. - Define single-page declaration as a page declaration with mandatory
select
. - Define multipage declaration as an object with a mandatory
combine
key containing at least 2 single-page declarations, and optionallyselect
,remove
andfilter
keys.- These keys at the multipage declaration level are interpreted as to be applied to each page declaration when they are not defined at that level.
Example
{
"name": "Facebook",
"documents": {
"Privacy Policy": {
"fetch": "https://fr-fr.facebook.com/privacy/explanation/",
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"select": ["div[role=\"main\"]"],
"remove": ["._5tko"],
"executeClientScripts": true
},
"Terms of Service": {
"fetch": "https://fr-fr.facebook.com/legal/terms/plain_text_terms",
"select": ["div[role=\"main\"]"],
"remove": ["footer[role=\"contentinfo\"]"],
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"executeClientScripts": true
},
"Community Guidelines": {
"combine": [
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/suicide-self-injury/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/hate-speech/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/child-sexual-exploitation-abuse-nudity/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/violence-incitement/" },
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/dangerous-individuals-organizations/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/spam/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72"]
},
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/regulated-goods/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/bullying-harassment/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/misinformation/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/intellectual-property/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/adult-nudity-sexual-activity/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/sexual-solicitation/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/inauthentic-behavior/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/privacy-violations-image-privacy-rights/" }
],
"select": ["._9ntw"],
"remove": ["._9nxl", "._9ntv", ".img"]
}
}
}
Thanks for this very neat and understandable propositions.
i'm in favor of the 2.D.i option which is the easiest to understand.
i'm fine with combine
even though (And I agree merge
would be too ambiguous in our context)
Thanks @Ndpnt @MattiSG it is very clear and complete !
I'm also in favor of the 2.D.i and using combine
seems to me to be a great idea that allows not to introduce the notion of page 👍
Looping in @Amustache @LVerneyPEReN @afisher3578 @Manu1400 @streitlua for them to vote on these options or suggest improvements 🙂
Thanks @MattiSG for the relevant improvement of the option 2D.
I am wondering if factored keys is easily understandable. :thinking: Should we introduce a specific term to wrap these keys?
It could be share
, or with
(combine
… with
…), or another term.
For example:
{
"name": "Facebook",
"documents": {
"Privacy Policy": {
"fetch": "https://fr-fr.facebook.com/privacy/explanation/",
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"select": ["div[role=\"main\"]"],
"remove": ["._5tko"],
"executeClientScripts": true
},
"Terms of Service": {
"fetch": "https://fr-fr.facebook.com/legal/terms/plain_text_terms",
"select": ["div[role=\"main\"]"],
"remove": ["footer[role=\"contentinfo\"]"],
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"executeClientScripts": true
},
"Community Guidelines": {
"combine": [
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/suicide-self-injury/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/hate-speech/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/child-sexual-exploitation-abuse-nudity/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/violence-incitement/" },
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/dangerous-individuals-organizations/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/spam/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72"]
},
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/regulated-goods/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/bullying-harassment/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/misinformation/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/intellectual-property/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/adult-nudity-sexual-activity/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/sexual-solicitation/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/inauthentic-behavior/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/privacy-violations-image-privacy-rights/" }
],
"with": {
"select": ["._9ntw"],
"remove": ["._9nxl", "._9ntv", ".img"]
}
}
}
}
or
{
"name": "Facebook",
"documents": {
"Privacy Policy": {
"fetch": "https://fr-fr.facebook.com/privacy/explanation/",
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"select": ["div[role=\"main\"]"],
"remove": ["._5tko"],
"executeClientScripts": true
},
"Terms of Service": {
"fetch": "https://fr-fr.facebook.com/legal/terms/plain_text_terms",
"select": ["div[role=\"main\"]"],
"remove": ["footer[role=\"contentinfo\"]"],
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"executeClientScripts": true
},
"Community Guidelines": {
"combine": {
"pages": [
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/suicide-self-injury/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/hate-speech/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/child-sexual-exploitation-abuse-nudity/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/violence-incitement/" },
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/dangerous-individuals-organizations/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/spam/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72"]
},
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/regulated-goods/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/bullying-harassment/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/misinformation/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/intellectual-property/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/adult-nudity-sexual-activity/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/sexual-solicitation/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/inauthentic-behavior/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/privacy-violations-image-privacy-rights/" }
],
"with": {
"select": ["._9ntw"],
"remove": ["._9nxl", "._9ntv", ".img"]
}
}
}
}
}
Thanks @Ndpnt for this proposal! It is interesting, but is currently lacking the level of formality that would enable us to debate it properly in the context of an RFC, as it diverges without offering a clear option that we could decide on. Could we please evolve this into a formal option, with a clear naming proposal? 🙂 Thanks!
I understand, you are right. I going to do this in the next days.
I also like option 2.D.i - it's easy to understand how it works and the writing overheads are minimal.
I agree that the "with" key could improve natural language readability, but I think that it's not necessary. In any case, I prefer @Ndpnt 's second proposition.
Option 2.D.i.a
Same as 2.D.i, but with the factorized values made explicit as default values with the suffix …Default
, for example selectDefault
.
In this context, I suggest writing the defaults key before the combine
key, as it is more common to have defaults set before their replacements.
Formal definition
- Redefine document declaration as single-page declaration or multipage declaration.
- Define page declaration as almost the same as the current document declaration, with its
select
,remove
,filter
keys are made optional, onlyfetch
is required. - Define single-page declaration as a page declaration with mandatory
fetch
andselect
. - Define multipage declaration as an object with a mandatory
combine
key containing at least 2 single-page declarations, and optionallyselectDefault
,removeDefault
andfilterDefault
keys.- These keys at the multipage declaration level are interpreted as to be applied to each page declaration when they are not defined at that level.
- These keys should be defined before the
combine
key
Example
{
"name": "Facebook",
"documents": {
"Privacy Policy": {
"fetch": "https://fr-fr.facebook.com/privacy/explanation/",
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"select": ["div[role=\"main\"]"],
"remove": ["._5tko"],
"executeClientScripts": true
},
"Terms of Service": {
"fetch": "https://fr-fr.facebook.com/legal/terms/plain_text_terms",
"select": ["div[role=\"main\"]"],
"remove": ["footer[role=\"contentinfo\"]"],
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"executeClientScripts": true
},
"Community Guidelines": {
"selectDefault": ["._9ntw"],
"removeDefault": ["._9nxl", "._9ntv", ".img"],
"combine": [
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/suicide-self-injury/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/hate-speech/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/child-sexual-exploitation-abuse-nudity/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/violence-incitement/" },
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/dangerous-individuals-organizations/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/spam/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72"]
},
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/regulated-goods/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/bullying-harassment/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/misinformation/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/intellectual-property/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/adult-nudity-sexual-activity/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/sexual-solicitation/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/inauthentic-behavior/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/privacy-violations-image-privacy-rights/" }
]
}
}
}
Option 2.D.i.b
Same as 2.D.i, but with the factorized values made explicit as default values within a new key defaults
.
In this context, I suggest writing the defaults
key before the combine
key, as it is more common to have defaults set before their replacements.
Formal definition
- Redefine document declaration as single-page declaration or multipage declaration.
- Define page declaration as almost the same as the current document declaration, with its
select
,remove
,filter
keys are made optional, onlyfetch
is required. - Define single-page declaration as a page declaration with mandatory
fetch
andselect
. - Define multipage declaration as an object with a mandatory
combine
key containing at least 2 single-page declarations, and optionally adefaults
key.defaults
key could contain optional keysselect
,remove
,filter
, but at meast one of them is required.- Keys defined in the
defaults
key at the multipage declaration level are interpreted as to be applied to each page declaration when they are not defined at that level. - The key
defaults
should be defined before thecombine
key
- Keys defined in the
Example
{
"name": "Facebook",
"documents": {
"Privacy Policy": {
"fetch": "https://fr-fr.facebook.com/privacy/explanation/",
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"select": ["div[role=\"main\"]"],
"remove": ["._5tko"],
"executeClientScripts": true
},
"Terms of Service": {
"fetch": "https://fr-fr.facebook.com/legal/terms/plain_text_terms",
"select": ["div[role=\"main\"]"],
"remove": ["footer[role=\"contentinfo\"]"],
"filter": [
"removeEmptyAnchorsLinks",
"removeTrackingIDs",
"removeLocaleFromUrls"
],
"executeClientScripts": true
},
"Community Guidelines": {
"defaults": {
"select": ["._9ntw"],
"remove": ["._9nxl", "._9ntv", ".img"],
},
"combine": [
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/suicide-self-injury/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/hate-speech/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/child-sexual-exploitation-abuse-nudity/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/violence-incitement/" },
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/dangerous-individuals-organizations/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72"]
},
{
"fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/spam/",
"select": ["._9nrm", "._9p7c"],
"remove": ["._9p72"]
},
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/regulated-goods/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/bullying-harassment/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/misinformation/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/intellectual-property/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/adult-nudity-sexual-activity/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/sexual-solicitation/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/inauthentic-behavior/" },
{ "fetch": "https://transparency.fb.com/fr-fr/policies/community-standards/privacy-violations-image-privacy-rights/" }
]
}
}
}
Thanks Nico for these proposals.
I am in favor of option 2.D.i because in options 2.D.i.a and 2.D.i.b the syntax is too different between a declaration for one page and a multipage declaration. This could be introduce complexity and confusion.
But I would be curious to know the opinion of users who are less used to manipulating this syntax.
I am in favor of option 2.D.i because in options 2.D.i.a and 2.D.i.b the syntax is too different between a declaration for one page and a multipage declaration. This could be introduce complexity and confusion.
In option 2.D.i, we have a select
and remove
without fetch
key and I'm not sure it's so obvious for contributors that they are defaults values that will be applied to each page declared in the combine
key when they are missing. So, in fact, the syntax is already different and there is a kind of magic. And I think it's better to be expose the magic and be explicit.
I voted through emojis as discussed in retrospective.
Also I believe 2.D.i with default at the top is enough and more readable than a defaults
key or suffixed Default
key
I also found option 2.D.i to still be easy to understand, even as someone that is new to this syntax.
Thanks everyone for your inputs and contributions on this first semi-formal RFC! 💖 I'm glad of the direction we're taking and the good collaboration around it 😊
We'll leave this open until next Tuesday for any additional comments. Until then, let's all try to stay focused on either casting votes on existing propositions, adding new ones formally, or adding objective data points 🙂
I noticed that, when we had a brief, transient issue with fetching documents on Instagram, we received a huge amount of notifications (and the same when the issue solved itself out) because the number of declared documents was very large in the implementation of option 1. The fact that all of the community guidelines were inaccessible at the same moment, and not other documents, is another hint that they are treated as a single group by platforms. As a maintainer, receiving all these notifications and trying to fix them was made needlessly more complex by having 20 documents instead of a single one.
In my view, this very much goes in favour of concatenating (option 2), which was the path we were already on anyway 😉
data:image/s3,"s3://crabby-images/81de7/81de7cf612dac980b9e816545f68495e9e64d22e" alt="Screen Shot 2022-05-19 at 11 14 10"