Greg Lindahl

Results 182 comments of Greg Lindahl

Why don't we just create the CDXJ index for news for now -- that's enough for integrity. Then we can kick the can of a potential future columnar index down...

Disorganization is fine because this is a squash sort of idea, and I can review the squash. I will get to this next week, I'm at a conference.

Yes, and because it isn't throttled, use of this package harms the target, which is me.

Any progress? I was hoping for rate limiting, honoring 503 and 429 status codes, and exponential backoff. And not just "unthrottled concurrency".

Thanks for adding to your TODO list, I appreciate it! Here's an example of making a single query in Athena that's much more efficient than gau: https://positive.security/blog/ransack-data-exfiltration#common-crawl

I have pretty vicious rate limits on the API now, so I expect that this software is broken.

@sebastian-nagel looking forward to your comments!

Last Wednesday in the Croissant WG meeting I pitched exactly this idea -- I want to start with a croissant being able to refer to other croissants. For example, I...

We are ready to try this feature out here at Common Crawl. The situation is that FineWeb 🥂 datasets on Hugging Face 🤗 use 96 Common Crawl crawls, and so...