baleen
baleen copied to clipboard
An automated ingestion service for blogs to construct a corpus for NLP research.
Taking some lessons from Steven Lott's PyData presentation: http://pydata.org/dc2016/schedule/presentation/40/ https://twitter.com/s_lott https://slott56.github.io/no-sql-doesnt-mean-no-schema/assets/player/KeynoteDHTMLPlayer.html#0 We can formalize the Mongo schemas using JSON and relying on JSON validation to ensure that we never even...
Timeout decorator introduced with https://github.com/bbengfort/baleen/commit/2e5d83767cfa3ceebfdada0680f713e73e10fbae Acceptance criteria: - Use decorator for methods with potentially very long running operation - Properly handle BaleenTimeout Errors at call sites for these methods
The pymongo driver is very strict and if it can't decode a mongo document it raises an exception. This is turning up in export where apparently (after 12 minutes or...
This method was originally written to wrap html snippets to look like a real web page. Now we have the ability to fetch complete web pages from RSS feeds. However...
Add a timeout so that if a post or feed is having trouble being downloaded, we skip it and carry on.
Update Quickstart documentation as we discover gaps at PyCon sprints.
Baleen crashes when Mongo refuses a connection; not sure why that's happening though.
The method: `baleen.models.Feed.count_posts` Is too slow on the deployment server. It seems that: `Post.objects(feed=self).count()` is going through the entire collection and filtering, which is bad. Need to figure out a...
The status screen in currently running got a bit wonky by accident: data:image/s3,"s3://crabby-images/770e0/770e0b4bc57d3c3ad876779c725e5d4c29404827" alt="screenshot 2016-04-19 12 52 24" I think this was just caused by us writing updates at the same...