🎯 Goal (What & Why)

Add a comprehensive data cleaning stage to the fast-llm prepare command.

fast-llm prepare currently downloads and tokenizes HuggingFace datasets into Fast-LLM's .bin / .idx format using a distributed torchrun setup. However, it performs no data cleaning, which limits training quality and poses risks around PII and malicious content.

This ticket adds a required data cleaning phase that is configurable, fast, and integrated into the distributed preprocessing loop. The goal is to improve model quality, reduce noise, follow best practices (see OLMo-2), and meet responsible AI standards by removing PII and malware from the training corpus.

🚀 Execution Plan

Step 1: What is the smallest working version?

Extend fast-llm prepare to apply a modular and configurable cleaning pipeline during preprocessing.
All cleaning steps must be integrated into the existing torchrun CPU-only distributed setup, preserving parallelism.

Step 2: Required cleaning filters (all must be implemented):

Length filtering
- Remove documents exceeding a configurable max length (in characters or tokens).
n-gram repetition
- Remove documents with ≥N repeated n-grams (default: 32), as in OLMo-2.
Frequency-based filtering
- Remove documents where:
  - The most frequent word exceeds X% of total tokens (default: 30%).
  - The top-2 most frequent words together exceed Y% of total tokens (default: 50%).
Binary content filtering
- Remove documents that contain mostly binary data.
Numerical content filtering
- Remove documents with a high percentage of numeric tokens (default: configurable threshold, e.g., 50%).
PII redaction
- Integrate Microsoft Presidio for detection and redaction or removal of documents containing sensitive personal information.
Malware removal
- Integrate ClamAV to scan documents and remove any that trigger detections.

All thresholds and filter behaviors must be exposed via the CLI and config files. Document-level logs or counters should be maintained for each filter to aid debugging and analysis.

📌 Acceptance Criteria

All listed filters are implemented and integrated into fast-llm prepare.
Cleaning is fully configurable, both via CLI and YAML config files.
The implementation works with the existing distributed CPU setup (torchrun + Gloo).
Performance remains acceptable.
Logs report how many documents are removed by each filter.
Code is tested and documented.
PR includes a performance/impact summary and example CLI usage.

🛠️ Project Management

[x] Assign the project to the Fast-LLM project.
[ ] Set the Estimate field (in days) in the GitHub project.
[x] Use the Size field to categorize the PR size (Small/Medium/Large).
[x] Assign an owner when opening the issue.

Jan 14 '25 13:01 tscholak

@tscholak Is this still relevant? Let's describe or close

Jan 28 '25 21:01 jlamypoirier

@jlamypoirier it's more relevant than ever.

Mar 23 '25 20:03 tscholak

How Should We Manage Model Downloading and Loading for Presidio and Virus Database Handling for ClamAV?

More details are available here.

The likely approach is to specify a cache folder and download the necessary files if they are not already present. However, we need to determine whether this should happen per run or persist across multiple runs. If the latter, we must address potential conflicts between different parallel executions.

Apr 01 '25 14:04 bigximik

Fast-LLM
Fast-LLM copied to clipboard

[feat] Add data cleaning in `fast-llm prepare`

🎯 Goal (What & Why)

🚀 Execution Plan

Step 1: What is the smallest working version?

Step 2: Required cleaning filters (all must be implemented):

📌 Acceptance Criteria

🛠️ Project Management

Fast-LLM Fast-LLM copied to clipboard

[feat] Add data cleaning in `fast-llm prepare`

🎯 Goal (What & Why)

🚀 Execution Plan

Step 1: What is the smallest working version?

Step 2: Required cleaning filters (all must be implemented):

📌 Acceptance Criteria

🛠️ Project Management

Fast-LLM
Fast-LLM copied to clipboard