Evgeny Pavlov

Results 185 issues of Evgeny Pavlov

### Proposal I expect new and updated models to pop up more often. There are our retraining efforts, the consortium is training new models and some third-party organizations also do...

enhancement

300M dataset, 128 GB RAM the workaround is to shuffle dataset after the merge step, disable `--shuffle-in-ram` and use `--shuffle batches`

bug
optimization

I continue testing the pipeline and I see that almost all teacher models don't continue training even after I increased patience by setting `early-stopping: 20`. Currently, continuation happens by training...

bug
quality

We need this to prevent further training if there is a bug. We can add an assert to the evaluation script. It will check that metrics are higher than some...

enhancement

I see that bicleaner-ai takes more time than 36 hours for some large datasets on pretty good GPU. This really depends on GPU model on HPC. Maybe it it's A100...

HPC

This issue is important only for HPC training where we don't want jobs to be too small, so we have to group them. It is even beneficial to have smaller...

HPC

1. Better integrate with the pipeline settings 2. Automatically discover models in MODELS_DIR 3. Remove intermediate file 4. Do not require to restart the script when a new model was...

enhancement

Ulrich: >The SentencePiece tokenizer should probably be trained with a custom normalization table (see the SentencePiece documentation) that removes soft hyphens in addition to the existing normalization steps. It requires...

good first issue
quality

## Current Behavior I'm uploading a large number of vectors to mmap collection with disabled indexing and getting `timeout: The read operation timed out`. I managed to upload 25667264 160-dimensional...

bug