baleen
baleen copied to clipboard
An automated ingestion service for blogs to construct a corpus for NLP research.
# Issue When specifying an RSS feed path with special characters such as `&`, `baleen` fails to find posts. I have confirmed that this can be corrected by manually escaping...
# Issue Exporting to a directory such as `corpora/` with `bin/baleen export corpora` results in an error like: `[Errno 2] No such file or directory: 'corpora/corpora/cooking/5b2d180b7af8b43e439b59b0.json'` This is a path...
console/commands/load can handle OPML files. I don't have OPML, and couldn't easily find an OPML editor. CSV is easy to compose, however. Add support for loading feeds from CSV.
According to these numbers: http://bbengfort.github.io/observations/2017/06/07/compression-benchmarks.html We can achieve much better export results if we gzip each file individually as we export them. This should help our export and admin process...
export.py's `--scheme` argument accepts json and html, as well as sanitize levels raw, safe, and text. Move sanitize levels to their own argument and ensure they get passed in properly...
So the goal was to avoid duplicate fetch of a post object that is already in Mongo. Alas, even if the post object is in mongo, we might have fetched...
Update the project to be compatible with Python 3.5 so we have the option to use asyncio.
Acceptance criteria: - Ability to configure the frequency of how often ingestion runs Current run frequency is hard coded to every hour: https://github.com/DistrictDataLabs/baleen/blob/master/baleen/console/commands/run.py#L51 We'll need to add a configuration option...
Add some interesting examples of EDA of the Baleen corpus export to add to documentation.
Write tests to make clear which Feed attributes could be changed from request to request. Which are not. Also it would be nice to have tests for mandatory feed attributes.