datatrove
datatrove copied to clipboard
Summary stats
This PR adds blocks for computing summary statistics
- Added base block for creating the summary statistics
- Added merging block for merging summary statistics computed in distributed manner.
- Blocks for: line, document, word, token, contamination level stats.
- Tests checking that both base block + merger works also tests for individual summary stats blocks.
- Example how to use
- Readme update
How stats works
- Each stat block has to implement
extract_stats
, which defines what statistics to track - Statistics are then grouped based on:
- summary -> All documents together
- fqdn -> documents grouped by fqdn
- suffix -> documents grouped by suffix
- histogram -> grouping is done based on values of statistics
Misc
- fixes bug in
from_dict
atsrc/datatrove/utils/stats.py