crusty icon indicating copy to clipboard operation
crusty copied to clipboard

Attaching a database

Open rtrevinnoc opened this issue 1 year ago • 2 comments

Hello, let4be

First of all I want to say it is really impressive what you have built, I am really amazed, so congratulations. Furthermore, I see that you wrote in the README file that one could attach a graph database to save the crawled data, but I can't quite understand how to do it and how would it fit in the dataflow, because I understand that crusty already saves the crawled data in some database.

I am interested in broad crawling, particularly with Rust, because I've been working on a peer2peer search engine, and thus I need a low-resource broad crawler. I have a (untidy) Python prototype which I would like to convert to Rust.

I would greatly appreciate if you could help me with this, so I could solve this problem for the search engine project. Thank you very much in advance. Kind regards.

rtrevinnoc avatar Jul 05 '22 05:07 rtrevinnoc

Hi @rtrevinnoc ! Thanks for taking interest in the project.

Right now crusty does Not persist any direct page data, but the data itself is easily available in code through various hooks.

Please check examples in https://github.com/let4be/crusty-core On how to access various data(parsed html or raw byte data, http headers, metrics)

In short this is a library that performs the crawling and crusty is built on top of it, you can check how crusty wrapped this functionality in code.

I think u'd want to start a threadpool and read data(that you'd grabbed from hooks) and then persist this data into database.

Crusty does some persistence right now, but for pre-aggregated/metrics-like data only. Crusty is so fast that persisting full page snapshots is prohibitively expensive and I just wanted to see how fast the crawling itself can be and what bottlenecks I'd find

Let me know if you have questions ;)

let4be avatar Jul 11 '22 11:07 let4be

Crusty is nothing more that a bunch of code as configuration for crusty-core and a handler of IO

I think u'd be interested mostly in implementing a TaskExpander<JobState, TaskState, Document> interface, getting the needed data here, sending it via a buffered flume channel and reading it from a threadpool and then persisting to the graph database of your choice. From this "hook" you can get all outbound page links and other useful info

All graph databases I saw are very resource houngry and slow on inserts... I'd be curious to know what are you planning to use and what the stored data structure would look like.

let4be avatar Jul 11 '22 11:07 let4be