Matt Jordan

Results 6 issues of Matt Jordan

Found it helpful to throw config object to my logger. This is easier when I can stash the configs as a dict. Added tests, too.

Lots of changes here (may be considered a refactor more than a PR, but will still require some heavy code reviews and discussion about which changes to keep/fold in). Summary...

Added `bff_v0.py` which is a simple python script to: 1) download all .jsonl.gz's from a specified S3 directory 2) Run BFF on ^ 3) Upload the outputs back to S3...

Several changes to main.rs: 1. Added progress bar printouts vs printouts at each filename (tried to use similar formatting as in `wimbd`) 2. Added directory support for inputs (can pass...

General updates to the `dedupe` command to do deduplication using a joint paragraph/document flow in the same way that DCLM does. Nuanced update list: Bloom Filter updates: - Used better...

Added a requirements.txt file just to make local installs easier. Got this by running uv on a clean environment and then calling `pip install dolma` and freezing the result.