firefox-translations-training Display corpus size in W&B

Display corpus size in W&B

Open eu9ene opened this issue 10 months ago • 3 comments

We should display things we look at often in W&B. Final merged corpus size after deduplication is something I look at periodically to understand how aggressive the cleaning is overall. We can also display corpus size after each cleaning stage as we discussed with @gregtatum which should probably be a part of the analysis job.

Apr 16 '24 18:04 eu9ene

As discussed today, we would expose either

the list of TSV files to parse_tc_logs so it can count their number of lines & publish that
directly count the nb ob lines in train.sh and provide it to parse_tc_logs so it publishes that number

Jul 01 '24 16:07 La0

We could also provide the final OpusTrainer config to the parser. It includes paths to the training datasets.

Jul 01 '24 16:07 eu9ene

I think the idea of this ticket was also to display the size of the corpus after different cleaning steps but we can start with uploading only the size of the final corpus.

Jul 01 '24 16:07 eu9ene

firefox-translations-training firefox-translations-training copied to clipboard

Display corpus size in W&B

firefox-translations-training
firefox-translations-training copied to clipboard