Greg Tatum comments

Results 375 comments of


                                            Greg Tatum

[meta] Ship 30 languages

This a meta bug so I don't think it needs to be assigned to me.

In the cleaning task, output statistics about filtered out/kept sentences and maybe attach list of filtered out sentences as an artifact

It would be nice to make it consistent with a `.stats.json` for every step, and then finally showing all of this information in W&B. For instance in bicleaner.sh: ```diff diff...

Add Hugging Face data importer

We'll need to be careful we don't open the pipeline up to arbitrary python execution. I noticed a few warnings while navigating huggingfaces about this, but I haven't looked into...

Getting a quick sample of data from the artifacts

I would like to have more samples of data in the artifacts just in general. It would be nice to have statistic about corpus size, how much was filtered, samples...

Support training without a monolingual corpus

This is probably blocked on #417.

I'm not a taskcluster expert, and maybe others can chime in here. This has information on the taskgraph that is generated: https://taskcluster-taskgraph.readthedocs.io/en/latest/ If you run the `utils/preflight_check.py`, it will generate...

Support training continuation for the failed or preempted tasks

(copying over my thoughts from #315). For reference, this is the definition of a [preemtible instance](https://cloud.google.com/compute/docs/instances/preemptible). During the Catalan run the teacher training would often take 2 or 3 times...

Support training continuation for the failed or preempted tasks

The next step here is to load in the previous artifacts and restart the training.

fix Dataset importer problems

Oh wait, the tests aren't passing. We should investigate that before merging.

Greg Tatum

[meta] Ship 30 languages

Snakemake dry run is broken

In the cleaning task, output statistics about filtered out/kept sentences and maybe attach list of filtered out sentences as an artifact

Add Hugging Face data importer

Getting a quick sample of data from the artifacts

Support training without a monolingual corpus

TaskCluster not via-CI

Support training continuation for the failed or preempted tasks

Support training continuation for the failed or preempted tasks

fix Dataset importer problems