datatrove
datatrove copied to clipboard

Published 20 hours ago •

Reame
Issues

Summary stats

Open hynky1999 opened this issue 10 months ago • 0 comments

This PR adds blocks for computing summary statistics

Added base block for creating the summary statistics
Added merging block for merging summary statistics computed in distributed manner.
Blocks for: line, document, word, token, contamination level stats.
Tests checking that both base block + merger works also tests for individual summary stats blocks.
Example how to use
Readme update

How stats works

Each stat block has to implement extract_stats, which defines what statistics to track
Statistics are then grouped based on:

summary -> All documents together
fqdn -> documents grouped by fqdn
suffix -> documents grouped by suffix
histogram -> grouping is done based on values of statistics

Misc

fixes bug in from_dict at src/datatrove/utils/stats.py

Apr 20 '24 13:04 hynky1999