bff
bff copied to clipboard
Big OpenLM/DCLM <-> AI2 PR # 1
Lots of changes here (may be considered a refactor more than a PR, but will still require some heavy code reviews and discussion about which changes to keep/fold in).
Summary of changes:
- Added commands for
bff
andsysreq
to get sense of how much memory a given BFF run will require - Changed some defaults of arguments:
- min-ngram/max-ngram now default to [20,20]
- by default the bloom filter file is not saved (this can be specified)
- annotations have been merged into a single argument
- progress bar present (but a
no-progress-bar
arg is also present) - some more abstraction/functions to break things up and eventually not repeat code when I push the S3 PR
- added BOTH level removal type (some discussion about what this does in the RemoveType enum)
- Added some printouts with BFF sparsity, removal rates, time
- misc performance-y things, like parallel iteration in some places