crawling-framework
crawling-framework copied to clipboard
Easily crawl news portals or blog sites using Storm Crawler.
When storing temporary data, ElasticSearch can become bottleneck. Optionally, use Redis for that.
Increasing In Topology.worker=4 Stop Doing Crawling. Then No Use Of Storm Cluster. If It Fail.
- [ ] use name - [ ] analyze source keyword
Stats button is showing the status of the crawl, but if there is nothing crawled it would be good to see it in the table, without opening the stats popup....
Currently configuration can be managed only through Administration UI
- [ ] Upload CSV with sources, related (#2) - [ ] Check which ones are already configured. - [ ] other validations TODO. - [ ] export CSV with...
Error should also log erroneous JSON so that we could learn how to pre-process it to avoid such errors ``` WARN l.t.c.p.u.JsonLdParser - Failed to parse ld+json com.fasterxml.jackson.core.JsonParseException: Document contains...