4cat icon indicating copy to clipboard operation
4cat copied to clipboard

Spin off scrapers into their own package

Open stijn-uva opened this issue 4 years ago • 1 comments

4chan is the odd one out now, being the only of the many datasources that has its own scraper. Works fine, but it might make more sense then to spin the scraper off into its own thing. This would also make it easier to separate the data store from the analytical part of 4CAT, and uncoupling them would protect the scraper from crashes originating within the rest of 4CAT

stijn-uva avatar Nov 25 '19 09:11 stijn-uva

Current plan: a separate tool/package that collects data, stores it as scraped in MongoDB, indexes it with ElasticSearch, and makes it available to 4CAT through a light-weight API that returns full documents for a given ElasticSearch query.

stijn-uva avatar May 03 '22 11:05 stijn-uva