Greg Lindahl
Greg Lindahl
Why don't we just create the CDXJ index for news for now -- that's enough for integrity. Then we can kick the can of a potential future columnar index down...
Disorganization is fine because this is a squash sort of idea, and I can review the squash. I will get to this next week, I'm at a conference.
Yes, and because it isn't throttled, use of this package harms the target, which is me.
Any progress? I was hoping for rate limiting, honoring 503 and 429 status codes, and exponential backoff. And not just "unthrottled concurrency".
Thanks for adding to your TODO list, I appreciate it! Here's an example of making a single query in Athena that's much more efficient than gau: https://positive.security/blog/ransack-data-exfiltration#common-crawl
Congratulations!
I have pretty vicious rate limits on the API now, so I expect that this software is broken.
@sebastian-nagel looking forward to your comments!
Last Wednesday in the Croissant WG meeting I pitched exactly this idea -- I want to start with a croissant being able to refer to other croissants. For example, I...
We are ready to try this feature out here at Common Crawl. The situation is that FineWeb 🥂 datasets on Hugging Face 🤗 use 96 Common Crawl crawls, and so...