Fast-LLM
Fast-LLM copied to clipboard
[feat] Add data cleaning in `fast-llm prepare`
🎯 Goal (What & Why)
Add a comprehensive data cleaning stage to the fast-llm prepare command.
fast-llm prepare currently downloads and tokenizes HuggingFace datasets into Fast-LLM's .bin / .idx format using a distributed torchrun setup. However, it performs no data cleaning, which limits training quality and poses risks around PII and malicious content.
This ticket adds a required data cleaning phase that is configurable, fast, and integrated into the distributed preprocessing loop. The goal is to improve model quality, reduce noise, follow best practices (see OLMo-2), and meet responsible AI standards by removing PII and malware from the training corpus.
🚀 Execution Plan
Step 1: What is the smallest working version?
- Extend
fast-llm prepareto apply a modular and configurable cleaning pipeline during preprocessing. - All cleaning steps must be integrated into the existing
torchrunCPU-only distributed setup, preserving parallelism.
Step 2: Required cleaning filters (all must be implemented):
- Length filtering
- Remove documents exceeding a configurable max length (in characters or tokens).
- n-gram repetition
- Remove documents with ≥N repeated n-grams (default: 32), as in OLMo-2.
- Frequency-based filtering
- Remove documents where:
- The most frequent word exceeds X% of total tokens (default: 30%).
- The top-2 most frequent words together exceed Y% of total tokens (default: 50%).
- Remove documents where:
- Binary content filtering
- Remove documents that contain mostly binary data.
- Numerical content filtering
- Remove documents with a high percentage of numeric tokens (default: configurable threshold, e.g., 50%).
- PII redaction
- Integrate Microsoft Presidio for detection and redaction or removal of documents containing sensitive personal information.
- Malware removal
- Integrate ClamAV to scan documents and remove any that trigger detections.
All thresholds and filter behaviors must be exposed via the CLI and config files. Document-level logs or counters should be maintained for each filter to aid debugging and analysis.
📌 Acceptance Criteria
- All listed filters are implemented and integrated into
fast-llm prepare. - Cleaning is fully configurable, both via CLI and YAML config files.
- The implementation works with the existing distributed CPU setup (torchrun + Gloo).
- Performance remains acceptable.
- Logs report how many documents are removed by each filter.
- Code is tested and documented.
- PR includes a performance/impact summary and example CLI usage.
🛠️ Project Management
- [x] Assign the project to the Fast-LLM project.
- [ ] Set the
Estimatefield (in days) in the GitHub project. - [x] Use the
Sizefield to categorize the PR size (Small/Medium/Large). - [x] Assign an owner when opening the issue.
@tscholak Is this still relevant? Let's describe or close
@jlamypoirier it's more relevant than ever.
How Should We Manage Model Downloading and Loading for Presidio and Virus Database Handling for ClamAV?
More details are available here.
The likely approach is to specify a cache folder and download the necessary files if they are not already present. However, we need to determine whether this should happen per run or persist across multiple runs. If the latter, we must address potential conflicts between different parallel executions.