baleen issues

xmlPaths in .opml feed definition files are unescaped

1

# Issue When specifying an RSS feed path with special characters such as `&`, `baleen` fails to find posts. I have confirmed that this can be corrected by manually escaping...

agodbehere

Export to directory other than '.' fails

1

# Issue Exporting to a directory such as `corpora/` with `bin/baleen export corpora` results in an error like: `[Errno 2] No such file or directory: 'corpora/corpora/cooking/5b2d180b7af8b43e439b59b0.json'` This is a path...

agodbehere

Add load from csv

3

console/commands/load can handle OPML files. I don't have OPML, and couldn't easily find an OPML editor. CSV is easy to compose, however. Add support for loading feeds from CSV.

janetriley

Export Compressed Posts

1

According to these numbers: http://bbengfort.github.io/observations/2017/06/07/compression-benchmarks.html We can achieve much better export results if we gzip each file individually as we export them. This should help our export and admin process...

bbengfort

type: feature

priority: high

intermediate

move sanitize to its own exporter option

2

export.py's `--scheme` argument accepts json and html, as well as sanitize levels raw, safe, and text. Move sanitize levels to their own argument and ensure they get passed in properly...

janetriley

Change post object in order to avoid duplicate fetch

So the goal was to avoid duplicate fetch of a post object that is already in Mongo. Alas, even if the post object is in mongo, we might have fetched...

tmeshorer

Update to use Python 3.5

20

Update the project to be compatible with Python 3.5 so we have the option to use asyncio.

janetriley

type: technical debt

in progress

priority: high

Configurable Scheduling

Acceptance criteria: - Ability to configure the frequency of how often ingestion runs Current run frequency is hard coded to every hour: https://github.com/DistrictDataLabs/baleen/blob/master/baleen/console/commands/run.py#L51 We'll need to add a configuration option...

will2041

intermediate

Examples for documentation

1

Add some interesting examples of EDA of the Baleen corpus export to add to documentation.

rebeccabilbro

novice

Write tests to make clear which Feed attributes could be changed

Write tests to make clear which Feed attributes could be changed from request to request. Which are not. Also it would be nice to have tests for mandatory feed attributes.

olgert

baleen
baleen copied to clipboard

Metadata

xmlPaths in .opml feed definition files are unescaped

Export to directory other than '.' fails

Add load from csv

Export Compressed Posts

move sanitize to its own exporter option

Change post object in order to avoid duplicate fetch

Update to use Python 3.5

Configurable Scheduling

Examples for documentation

Write tests to make clear which Feed attributes could be changed

← Metadata

Owner

Metadata

baleen baleen copied to clipboard

Metadata

← Metadata

Owner

Metadata

baleen
baleen copied to clipboard