Viktor
Viktor
This can be done by using a headless browser to fetch the root document, and analyze the rendered DOM. The existing junk-detection solution works decently well, but uses static HTML...
Implement a basic "safe search" filter for removing NSFW results. A naive bayesian filter or something along those lines probably goes a long way, there are also "bad website"-lists that...
The crawler currently avoids git forges as crawling them is very resource intensive for the remote server. A crawler specialization that understands to stay on the main branch and e.g....
Add capability to index PDF files (when they have text data, OCR is out of scope).