data-stories-scripts Data Story: Generate datasets from Wikipedia dumps

Data Story: Generate datasets from Wikipedia dumps

Open feep opened this issue 4 years ago • 2 comments

Content

[x] Parse out all tables
[ ] Trim garbage tables
[ ] Dataset/tables: table title, row count, column count, url, page title, hit_count, id, revision.timestamp
[ ] Multiple datasets: generate datasets for top tables by hits over XX rows
[ ] Multiple datasets: generate datasets for hand-picked useful tables
[ ] Transform script to generate page from raw mediawiki markup or single page html

Story? The tables available and how they were generated? Histogram with size of tables?

Probably out of scope

[ ] Possible longitudinal tables, every Directed By: from infobox table from every (film) page

Pageviews

These are available as an API, but I’m not going to hit the API for every page on wikipedia. They need to be available as a dataset.

[x] Aggregate and subsample
[x] Trim garbage pages (Special:, Main_Page...)
[x] Script to generate
[ ] Dataset for relative hits for all pages, month of 202003.

No story, not at this time. Data only used for the hit_count column in the third checkbox under content.

Apr 07 '20 13:04 feep

@rgardaphe

Testing your @ settings. You get this?

Apr 22 '20 17:04 feep

YEP!

On Wed, Apr 22, 2020 at 1:47 PM rusty kay [email protected] wrote:

@rgardaphe https://github.com/rgardaphe

Testing your @ settings. You get this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qri-io/data-stories-scripts/issues/11#issuecomment-617930821, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGYAFTYYVWUHFP3Q74WDN43RN4UTJANCNFSM4MDEAI7Q .

-- Rico Gardaphe *Qri *- A dataset version control system for the distributed web

Visit us at Qri.io Follow us on Twitter https://twitter.com/qri_io Check out our Github https://github.com/qri-io

Apr 22 '20 19:04 rgardaphe

data-stories-scripts data-stories-scripts copied to clipboard

Data Story: Generate datasets from Wikipedia dumps

Content

Probably out of scope

Pageviews

data-stories-scripts
data-stories-scripts copied to clipboard