crawlers
crawlers copied to clipboard
possibility of a com.norconex.collector.http.data.store.impl.dynamodb ?
Along the lines of the mongodb driver a dynamodb driver would be great. If there are no plans/bandwidth to make one, please post any guidance here for a novice java developer to get started.
Are you talking about the URL crawl store? If so, there are no plans to have a dynamodb implementation, but we can make this a feature request and we'll get to it if there is enough demand.
A crawl store is a cache of what has already been crawled (e.g., to help detect modifications and deletions) and is not meant to store all content + metadata. For this, you need a Committer.
If you want to create a new crawl store, look at how the current ones are implemented, such as MongoDB here and there.
Committers should be simpler to implement and they usually are what you want. Have a look here to get started. Again, you may want to check how existing ones were done.
Does this help?
If you end up creating your own, let us know!
Awesome response. Awesome software. Thank you!
Yes, I'm looking for the collector, not the committer. Point of note however, my research indicated problems with SSL/TCP connections with Mondo. I would read that as the current collector user base may well be simply using local network databases (maybe).
Either way, any new ticket should probably mention testing with secure protocols.
Thanks a ton. Can't wait to use this!
Also, to close the loop, I have a completely separate question in to a Valerie Draper at Norconex. Just want to close the loop on that point. I hope it doesn't cause confusion, as they are both technical in nature.
Regards,
- Pete Lombardo
+1 - DynamoDB is easy in AWS and a crawler like norconex has a calculable number of requests, which fits the DynamoDB provisioning system. DynamoDB crawlstore + S3StatusStore would mean that Norconex collector can be run on an AWS spot instance once a week, and people save loads.
Current system, the best case would be MVstore on a persistent EBS volume that you reload to recrawl the next week or what not, but you still need some way to get the status off the box, e.g. periodic s3 sync etc.
DynamoDB crawlstore + S3StatusStore would be a nearly complete solution without the result to DevOps like scripting to get it done (like another guys mkfifo).