data-stories-scripts icon indicating copy to clipboard operation
data-stories-scripts copied to clipboard

Data Story: Generate datasets from Wikipedia dumps

Open feep opened this issue 4 years ago • 2 comments

Content

  • [x] Parse out all tables
  • [ ] Trim garbage tables
  • [ ] Dataset/tables: table title, row count, column count, url, page title, hit_count, id, revision.timestamp
  • [ ] Multiple datasets: generate datasets for top tables by hits over XX rows
  • [ ] Multiple datasets: generate datasets for hand-picked useful tables
  • [ ] Transform script to generate page from raw mediawiki markup or single page html

Story? The tables available and how they were generated? Histogram with size of tables?

Probably out of scope

  • [ ] Possible longitudinal tables, every Directed By: from infobox table from every (film) page

Pageviews

These are available as an API, but I’m not going to hit the API for every page on wikipedia. They need to be available as a dataset.

  • [x] Aggregate and subsample
  • [x] Trim garbage pages (Special:, Main_Page...)
  • [x] Script to generate
  • [ ] Dataset for relative hits for all pages, month of 202003.

No story, not at this time. Data only used for the hit_count column in the third checkbox under content.

feep avatar Apr 07 '20 13:04 feep

@rgardaphe

Testing your @ settings. You get this?

feep avatar Apr 22 '20 17:04 feep

YEP!

On Wed, Apr 22, 2020 at 1:47 PM rusty kay [email protected] wrote:

@rgardaphe https://github.com/rgardaphe

Testing your @ settings. You get this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qri-io/data-stories-scripts/issues/11#issuecomment-617930821, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGYAFTYYVWUHFP3Q74WDN43RN4UTJANCNFSM4MDEAI7Q .

-- Rico Gardaphe *Qri *- A dataset version control system for the distributed web

Visit us at Qri.io Follow us on Twitter https://twitter.com/qri_io Check out our Github https://github.com/qri-io

rgardaphe avatar Apr 22 '20 19:04 rgardaphe