Timothee Mickus
Timothee Mickus
A model component (e.g. the Swahili encoder) is likely to exist on multiple devices. Because each device samples its own task sequence, it is possible that when a gradient synchronization...
closes #63 . same idea as v2, didn't bother porting from it. does not embark bucket states, although this could maybe be done by picking the line indices from all...
Currently, the `--train_from` option does not include means of restoring corpora states, hence training resumes from the beginning of the bitexts. This entails resumed models are training on a subset...
closes #60
Going through the existing catalogue of options listed in our docs, a number of them seem to not be plugged in. The list below is most likely not exhaustive. ###...
Currently, we only support training encoder-decoder models. We might want to support encoder-only (e.g. BERT) and decoder-only models (e.g. GPTs). This could be inferred automatically from the types of sharing...
Add a feature to learn virtual embeddings for prompt/prefix learning on a pretrained model. This would depend on #24 being implemented first.
Currently, we rely on custom-made layer / encoder definitions for our modules. Cf. for instance this class: https://github.com/Helsinki-NLP/mammoth/blob/c6a193b1cc16bf7140520c44712bcf82701ec87d/mammoth/modules/transformer_encoder.py#L13 This entails that any architectural variant we wish to test has to...
freezing some of the modules would allow training adapter as actual adapters. Ideally, this would entail introducing some mechanism to mark in the config specific layerstacks/adapaters as not requiring gradient....
Load unbalance is a very likely candidates for the scaling issues we faced. This PR introduces a couple new flags to enforce equal load across nodes, which seems to result...