Results 18 issues of Timothee Mickus

A model component (e.g. the Swahili encoder) is likely to exist on multiple devices. Because each device samples its own task sequence, it is possible that when a gradient synchronization...

enhancement

closes #63 . same idea as v2, didn't bother porting from it. does not embark bucket states, although this could maybe be done by picking the line indices from all...

Currently, the `--train_from` option does not include means of restoring corpora states, hence training resumes from the beginning of the bitexts. This entails resumed models are training on a subset...

enhancement

closes #60

Going through the existing catalogue of options listed in our docs, a number of them seem to not be plugged in. The list below is most likely not exhaustive. ###...

bug
documentation
enhancement

Currently, we only support training encoder-decoder models. We might want to support encoder-only (e.g. BERT) and decoder-only models (e.g. GPTs). This could be inferred automatically from the types of sharing...

enhancement
good first issue

Add a feature to learn virtual embeddings for prompt/prefix learning on a pretrained model. This would depend on #24 being implemented first.

enhancement

Currently, we rely on custom-made layer / encoder definitions for our modules. Cf. for instance this class: https://github.com/Helsinki-NLP/mammoth/blob/c6a193b1cc16bf7140520c44712bcf82701ec87d/mammoth/modules/transformer_encoder.py#L13 This entails that any architectural variant we wish to test has to...

enhancement
good first issue

freezing some of the modules would allow training adapter as actual adapters. Ideally, this would entail introducing some mechanism to mark in the config specific layerstacks/adapaters as not requiring gradient....

enhancement
good first issue

Load unbalance is a very likely candidates for the scaling issues we faced. This PR introduces a couple new flags to enforce equal load across nodes, which seems to result...