Why do we not use `commoncrawl` indices, and then possibly build upon them?

Open sga-13 opened this issue 5 months ago • 1 comments

I do not understand much about search engines, so I was reading about them. Then I stumbled upon commoncrawl. I know that stract uses it's own crawler, but I have found the index still smaller than I would like. I also searched commoncrawl in github issues, and found 2 issues, where it has been recommended to the local hosters to use the commoncrawl's warc files. So why does not stract use them? Are they lacking in something that I do not know if, or is there a limit in using them (like not to be used for commercial projects (I hope that is not the case, since they used can be used by everyone multiple times on their pages)), or is it purely a choice based on quality or some other thing (maybe the averge result quality is not that good, or does not meet stracts expectation in the data/metadata provided).

Jul 11 '25 17:07 sga-13

I just wrote a blog post about how I got Stract to work with data from Common Crawl. I had to hack the parser a little bit ... I link to the diffs in my branch in the blog post.

https://github.com/StractOrg/stract/discussions/264

Jul 18 '25 17:07 jimpick