Rodney Kinney

Results 25 comments of Rodney Kinney

Observations on using [bff](https://github.com/allenai/bff) for paragraph-level deduping: Runs fine on server machine. Run-time is about 2x the merger: 100 CPU hours per CC dump. I used a 150GB Bloom Filter,...

From the analysis, there's about 1% rate of duplication by URL in a dump. Paragraph-level deduping is probably not the right way to handle these even if the error rate...

Deduped two combined dumps by URL. Number of removed documents was still ~1%, suggesting little overlap between dumps.

Quick estimate for open-access papers in S2orc: Titles + Abstracts = 14.4B characters Body Text = 468B characters

Not sure what's happening with automated tests. Maybe timing out? `make test` passes locally, except for the `test_download_file` Rust test, which also fails on the main branch.