distributed-wikipedia-mirror
distributed-wikipedia-mirror copied to clipboard
Establish a data control plan
we should outline the control plan (hand over to wikipedia itself, etc)
Though we hope that wikipedia will take this under their wing sometime, we should not assume that they will. Based on that, we're setting up a community-based model for managing the generation of snapshots from kiwix dumps. This is one of the first tests of the model that evolved out of the Data Rescue hackathons in early 2017 -- where communities of hackers, content specialists and do-gooders work together to manage the work of pulling data off of centralized servers and redistributing it.
To apply this model we're partnering with @b5 from http://www.qri.io/, who did a lot of the technical work behind the Data Rescue hackathons. Many other people like @dcwalk @titaniumbones @mayaad @trinberg @ abergman contributed to the evolution of this model.
The Process
Key elements of this process:
- Embrace community contributions with an open model of community governance. In short, use github and PRs to manage everything. Actively embrace contributions by community members, give them a voice in governance of the code, and provide a clear definition of the requirements to become a committer.
- Use code to automate repeatable tasks: rather than having lots of people write one-off scripts and run them once, put that energy into building and maintaining reusable scripts.
- Need to be careful about provenance and chain of custody: It's important to be clear exactly where the snapshots came from and exactly what was done to them. To enforce this, we have to be careful about who runs the scripts and how they run the scripts.
Balancing Open Community with Careful Chain of Custody
It may seem like the open community model is at odds with maintaining a clear chain of custody when processing the snapshots. Here's how we will balance the two:
Open community contributions (via github Pull Requests, etc) wherever possible.
- maintaining the scripts that pull dumps from kiwix
- maintaining any scripts that modify snapshots and write them to ipfs
- nominating new language variants to be added as snapshots
- deciding when to run new snapshots
- maintaining the docker container that is used to run these scripts ... With an open governance model around who can become a committer on the repo, etc.
Meanwhile a smaller group of committers will handle:
- running the scripts, using the community-managed docker image, to generate new snapshots
- publishing updates to the IPNS entries
Eventually we might incorporate cryptographic techniques (ie. SNARKS) to prove that the intended operations (and only the intended operations) were run on the snapshots, which would allow anyone to build the snapshots without corrupting the chain of custody. This will require some research. For now, it's overkill.
Note: one cool thing about using IPFS with this structure: if you want to validate that someone actually ran the scripts they claim, you can just re-run the scripts from the same sources and compare the hashes of the results...
pinging @patcon and ~@meyerscr~ (edit, didn't need to ;)) to watch here
Ok we've started to make progress on this. Currently this is just defaulting to sending emails while we figure out how to connect the requests to a queue, but it's a start.
Live url here: https://task-mgmt.archivers.space Repo here: https://github.com/archivers-space/task-mgmt
Note, you'll need write access to ipfs/distributed-wikipedia-mirror in order to access the page.
I've outlined some next steps in the repo readme, @flyingzumwalt it might make sense to touch base on next steps sometime soon, specifically around the question of where the actual task execution is going to happen. If we need to build that, that's ok. In the meantime I still have lots to chew on.
The archivers requesting full private repo access is no go for me unfortunately.
Many platforms allow for public and separate upgrade to private repo access when need arrives.
Is archivers requesting access? I thought it was just using GH oauth response to know if the user has write access to this repo -- so you need write permission in the GH repo in order to manage stuff in archivers. That lets us set it up so that anyone who can modify this repo can also manage things in archivers like kicking off building a new snapshot. The actual submission of new content from archivers or from the workers it runs will be done vi PRs, which does not require write access to this repo.
The management page does: https://task-mgmt.archivers.space if you try to login with GH.

aha. yeah we have to change that.
Oh yes completely agreed. I'll drop the permissions ask, will report back once the change is up.
Ok, change is now live. App shouldn't request access to private repos.
Update: @b5 is making amazing progress building a robust and reusable solution for our data-control needs https://github.com/datatogether/task-mgmt/pull/4