dolma
dolma copied to clipboard
Text modification config
Add mixer configuration to trim trailing/leading whitespace from document text, and enforce a minimum document text length. Place these into a new text_modification
config object, and move the span_replacements
config into it.
@soldni any objections to this backward-incompatible change to config structure?
Not sure what's happening with automated tests. Maybe timing out?
make test
passes locally, except for the test_download_file
Rust test, which also fails on the main branch.