firefox-translations-training
firefox-translations-training copied to clipboard
Display corpus size in W&B
We should display things we look at often in W&B. Final merged corpus size after deduplication is something I look at periodically to understand how aggressive the cleaning is overall. We can also display corpus size after each cleaning stage as we discussed with @gregtatum which should probably be a part of the analysis job.
As discussed today, we would expose either
- the list of TSV files to
parse_tc_logs
so it can count their number of lines & publish that - directly count the nb ob lines in
train.sh
and provide it toparse_tc_logs
so it publishes that number
We could also provide the final OpusTrainer config to the parser. It includes paths to the training datasets.
I think the idea of this ticket was also to display the size of the corpus after different cleaning steps but we can start with uploading only the size of the final corpus.