Rodney Kinney
Rodney Kinney
Observations on using [bff](https://github.com/allenai/bff) for paragraph-level deduping: Runs fine on server machine. Run-time is about 2x the merger: 100 CPU hours per CC dump. I used a 150GB Bloom Filter,...
From the analysis, there's about 1% rate of duplication by URL in a dump. Paragraph-level deduping is probably not the right way to handle these even if the error rate...
Deduped two combined dumps by URL. Number of removed documents was still ~1%, suggesting little overlap between dumps.
Quick estimate for open-access papers in S2orc: Titles + Abstracts = 14.4B characters Body Text = 468B characters
Not sure what's happening with automated tests. Maybe timing out? `make test` passes locally, except for the `test_download_file` Rust test, which also fails on the main branch.