debate-cards icon indicating copy to clipboard operation
debate-cards copied to clipboard

Open evidence and Wiki scraper

Open D0ugins opened this issue 3 years ago • 3 comments

Downloads round data and open source documents/cites from debate wiki. Also downloads file from openev. Uses xiki's rest api to pull the data. Main limiting factor is the response speed from the server, although should only take a day or two to run. In total there are around 320k rounds across the wikis with roughly half having open source documents + around 10k open ev files.

Todo:

  • Implement adding new rounds as they are created, this can be done with roughly one request per new round, so it can just be run once per day or something.
  • Add tags to downloaded files.
  • Maybe add some sort of parsing of round reports and/or cites. Maybe just extract links from cites and try to split round report by speech.
  • Better erorr handling in the parser for weird formats.

D0ugins avatar Feb 18 '22 21:02 D0ugins

Trying to run this, but the application seems to hang (I think while trying to load spaceData?). Any idea what's up?

arvind-balaji avatar Jun 25 '22 06:06 arvind-balaji

Sorry, should have clarified. Loading the list of rounds to download takes a long time (Something like 30 minutes irrc) If you want to load data quicker for testing you can add a .slice(0, 2) or .slice(0, 1) on these two lines so you only load the full data for a few of the wikis. https://github.com/arvind-balaji/debate-cards/blob/e401edee268797b5afb22bcf6b9ff349e9e5eac4/src/lib/debate-tools/wiki.ts#L76-L78 https://github.com/arvind-balaji/debate-cards/blob/e401edee268797b5afb22bcf6b9ff349e9e5eac4/src/lib/debate-tools/wiki.ts#L85 In the future it would probably be a good idea to add some way of configuring which wikis to load

D0ugins avatar Jun 25 '22 17:06 D0ugins

Wiki was just updated and the api overhauled, terms now also ban bulk downloads of data. I have a dump of most of the relevant data though.

D0ugins avatar Jul 23 '22 03:07 D0ugins