baleen icon indicating copy to clipboard operation
baleen copied to clipboard

An automated ingestion service for blogs to construct a corpus for NLP research.

Results 22 baleen issues
Sort by recently updated
recently updated
newest added

# Issue When specifying an RSS feed path with special characters such as `&`, `baleen` fails to find posts. I have confirmed that this can be corrected by manually escaping...

# Issue Exporting to a directory such as `corpora/` with `bin/baleen export corpora` results in an error like: `[Errno 2] No such file or directory: 'corpora/corpora/cooking/5b2d180b7af8b43e439b59b0.json'` This is a path...

console/commands/load can handle OPML files. I don't have OPML, and couldn't easily find an OPML editor. CSV is easy to compose, however. Add support for loading feeds from CSV.

According to these numbers: http://bbengfort.github.io/observations/2017/06/07/compression-benchmarks.html We can achieve much better export results if we gzip each file individually as we export them. This should help our export and admin process...

type: feature
priority: high
intermediate

export.py's `--scheme` argument accepts json and html, as well as sanitize levels raw, safe, and text. Move sanitize levels to their own argument and ensure they get passed in properly...

So the goal was to avoid duplicate fetch of a post object that is already in Mongo. Alas, even if the post object is in mongo, we might have fetched...

Update the project to be compatible with Python 3.5 so we have the option to use asyncio.

type: technical debt
in progress
priority: high

Acceptance criteria: - Ability to configure the frequency of how often ingestion runs Current run frequency is hard coded to every hour: https://github.com/DistrictDataLabs/baleen/blob/master/baleen/console/commands/run.py#L51 We'll need to add a configuration option...

intermediate

Add some interesting examples of EDA of the Baleen corpus export to add to documentation.

novice

Write tests to make clear which Feed attributes could be changed from request to request. Which are not. Also it would be nice to have tests for mandatory feed attributes.