data-stories-scripts
data-stories-scripts copied to clipboard
Data Story: Generate datasets from Wikipedia dumps
Content
- [x] Parse out all tables
- [ ] Trim garbage tables
- [ ] Dataset/tables: table title, row count, column count, url, page title, hit_count, id, revision.timestamp
- [ ] Multiple datasets: generate datasets for top tables by hits over XX rows
- [ ] Multiple datasets: generate datasets for hand-picked useful tables
- [ ] Transform script to generate page from raw mediawiki markup or single page html
Story? The tables available and how they were generated? Histogram with size of tables?
Probably out of scope
- [ ] Possible longitudinal tables, every
Directed By:
from infobox table from every(film)
page
Pageviews
These are available as an API, but I’m not going to hit the API for every page on wikipedia. They need to be available as a dataset.
- [x] Aggregate and subsample
- [x] Trim garbage pages (Special:, Main_Page...)
- [x] Script to generate
- [ ] Dataset for relative hits for all pages, month of 202003.
No story, not at this time. Data only used for the hit_count
column in the third checkbox under content.
@rgardaphe
Testing your @ settings. You get this?
YEP!
On Wed, Apr 22, 2020 at 1:47 PM rusty kay [email protected] wrote:
@rgardaphe https://github.com/rgardaphe
Testing your @ settings. You get this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qri-io/data-stories-scripts/issues/11#issuecomment-617930821, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGYAFTYYVWUHFP3Q74WDN43RN4UTJANCNFSM4MDEAI7Q .
-- Rico Gardaphe *Qri *- A dataset version control system for the distributed web
Visit us at Qri.io Follow us on Twitter https://twitter.com/qri_io Check out our Github https://github.com/qri-io