distributed-wikipedia-mirror Establish a data control plan

we should outline the control plan (hand over to wikipedia itself, etc)

May 01 '17 17:05 jbenet

Though we hope that wikipedia will take this under their wing sometime, we should not assume that they will. Based on that, we're setting up a community-based model for managing the generation of snapshots from kiwix dumps. This is one of the first tests of the model that evolved out of the Data Rescue hackathons in early 2017 -- where communities of hackers, content specialists and do-gooders work together to manage the work of pulling data off of centralized servers and redistributing it.

To apply this model we're partnering with @b5 from http://www.qri.io/, who did a lot of the technical work behind the Data Rescue hackathons. Many other people like @dcwalk @titaniumbones @mayaad @trinberg @ abergman contributed to the evolution of this model.

The Process

Key elements of this process:

Embrace community contributions with an open model of community governance. In short, use github and PRs to manage everything. Actively embrace contributions by community members, give them a voice in governance of the code, and provide a clear definition of the requirements to become a committer.
Use code to automate repeatable tasks: rather than having lots of people write one-off scripts and run them once, put that energy into building and maintaining reusable scripts.
Need to be careful about provenance and chain of custody: It's important to be clear exactly where the snapshots came from and exactly what was done to them. To enforce this, we have to be careful about who runs the scripts and how they run the scripts.

Balancing Open Community with Careful Chain of Custody

It may seem like the open community model is at odds with maintaining a clear chain of custody when processing the snapshots. Here's how we will balance the two:

Open community contributions (via github Pull Requests, etc) wherever possible.

maintaining the scripts that pull dumps from kiwix
maintaining any scripts that modify snapshots and write them to ipfs
nominating new language variants to be added as snapshots
deciding when to run new snapshots
maintaining the docker container that is used to run these scripts ... With an open governance model around who can become a committer on the repo, etc.

Meanwhile a smaller group of committers will handle:

running the scripts, using the community-managed docker image, to generate new snapshots
publishing updates to the IPNS entries

Eventually we might incorporate cryptographic techniques (ie. SNARKS) to prove that the intended operations (and only the intended operations) were run on the snapshots, which would allow anyone to build the snapshots without corrupting the chain of custody. This will require some research. For now, it's overkill.

May 12 '17 01:05 flyingzumwalt

Note: one cool thing about using IPFS with this structure: if you want to validate that someone actually ran the scripts they claim, you can just re-run the scripts from the same sources and compare the hashes of the results...

May 12 '17 01:05 flyingzumwalt

pinging @patcon and ~@meyerscr~ (edit, didn't need to ;)) to watch here

May 13 '17 20:05 dcwalk

Ok we've started to make progress on this. Currently this is just defaulting to sending emails while we figure out how to connect the requests to a queue, but it's a start.

Live url here: https://task-mgmt.archivers.space Repo here: https://github.com/archivers-space/task-mgmt

Note, you'll need write access to ipfs/distributed-wikipedia-mirror in order to access the page.

I've outlined some next steps in the repo readme, @flyingzumwalt it might make sense to touch base on next steps sometime soon, specifically around the question of where the actual task execution is going to happen. If we need to build that, that's ok. In the meantime I still have lots to chew on.

May 17 '17 18:05 b5

The archivers requesting full private repo access is no go for me unfortunately.

Many platforms allow for public and separate upgrade to private repo access when need arrives.

May 19 '17 20:05 Kubuxu

Is archivers requesting access? I thought it was just using GH oauth response to know if the user has write access to this repo -- so you need write permission in the GH repo in order to manage stuff in archivers. That lets us set it up so that anyone who can modify this repo can also manage things in archivers like kicking off building a new snapshot. The actual submission of new content from archivers or from the workers it runs will be done vi PRs, which does not require write access to this repo.

May 20 '17 01:05 flyingzumwalt

The management page does: https://task-mgmt.archivers.space if you try to login with GH.

May 20 '17 14:05 Kubuxu

aha. yeah we have to change that.

May 20 '17 15:05 flyingzumwalt

Oh yes completely agreed. I'll drop the permissions ask, will report back once the change is up.

May 20 '17 16:05 b5

Ok, change is now live. App shouldn't request access to private repos.

May 30 '17 16:05 b5

Update: @b5 is making amazing progress building a robust and reusable solution for our data-control needs https://github.com/datatogether/task-mgmt/pull/4

Jul 07 '17 04:07 flyingzumwalt

distributed-wikipedia-mirror distributed-wikipedia-mirror copied to clipboard

Establish a data control plan

The Process

Balancing Open Community with Careful Chain of Custody

distributed-wikipedia-mirror
distributed-wikipedia-mirror copied to clipboard