datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Summary stats

Open hynky1999 opened this issue 10 months ago • 0 comments

This PR adds blocks for computing summary statistics

  • Added base block for creating the summary statistics
  • Added merging block for merging summary statistics computed in distributed manner.
  • Blocks for: line, document, word, token, contamination level stats.
  • Tests checking that both base block + merger works also tests for individual summary stats blocks.
  • Example how to use
  • Readme update

How stats works

  • Each stat block has to implement extract_stats, which defines what statistics to track
  • Statistics are then grouped based on:
  • summary -> All documents together
  • fqdn -> documents grouped by fqdn
  • suffix -> documents grouped by suffix
  • histogram -> grouping is done based on values of statistics

Misc

  • fixes bug in from_dict at src/datatrove/utils/stats.py

hynky1999 avatar Apr 20 '24 13:04 hynky1999