bff
bff copied to clipboard
Lots of changes here (may be considered a refactor more than a PR, but will still require some heavy code reviews and discussion about which changes to keep/fold in). Summary...
Added `bff_v0.py` which is a simple python script to: 1) download all .jsonl.gz's from a specified S3 directory 2) Run BFF on ^ 3) Upload the outputs back to S3...
Several changes to main.rs: 1. Added progress bar printouts vs printouts at each filename (tried to use similar formatting as in `wimbd`) 2. Added directory support for inputs (can pass...
Thanks for sharing the great codes!! They have been very useful for me! I'm new to Rust and bloom filter and I have one question regarding the deduplication scope in...
@chris-ha458 has made some great improvements to BFF in the https://github.com/allenai/dolma repo. We should back-port those changes here, especially the ones that have to do with correctness (like the ones...
One thing that might be worth documenting when we get a chance is that the "bff_duplicate_spans" that are created by the `--annotate-only` are byte spans rather than character spans as...
Hi @dirkgr! Here is a feature that would be very much desirable for decontamination, but I'm not sure how difficult it would be to implement into BFF: The essential part...