Evgeny Pavlov
Evgeny Pavlov
### Proposal I expect new and updated models to pop up more often. There are our retraining efforts, the consortium is training new models and some third-party organizations also do...
300M dataset, 128 GB RAM the workaround is to shuffle dataset after the merge step, disable `--shuffle-in-ram` and use `--shuffle batches`
I continue testing the pipeline and I see that almost all teacher models don't continue training even after I increased patience by setting `early-stopping: 20`. Currently, continuation happens by training...
We need this to prevent further training if there is a bug. We can add an assert to the evaluation script. It will check that metrics are higher than some...
I see that bicleaner-ai takes more time than 36 hours for some large datasets on pretty good GPU. This really depends on GPU model on HPC. Maybe it it's A100...
This issue is important only for HPC training where we don't want jobs to be too small, so we have to group them. It is even beneficial to have smaller...
1. Better integrate with the pipeline settings 2. Automatically discover models in MODELS_DIR 3. Remove intermediate file 4. Do not require to restart the script when a new model was...
Ulrich: >The SentencePiece tokenizer should probably be trained with a custom normalization table (see the SentencePiece documentation) that removes soft hyphens in addition to the existing normalization steps. It requires...
## Current Behavior I'm uploading a large number of vectors to mmap collection with disabled indexing and getting `timeout: The read operation timed out`. I managed to upload 25667264 160-dimensional...