Evgeny Pavlov issues

Results 185 issues of


                                            Evgeny Pavlov

Automatic updates of model registry

### Proposal I expect new and updated models to pop up more often. There are our retraining efforts, the consortium is training new models and some third-party organizations also do...

enhancement

Out of memory on shuffling huge datasets

300M dataset, 128 GB RAM the workaround is to shuffle dataset after the merge step, disable `--shuffle-in-ram` and use `--shuffle batches`

bug

optimization

Teacher does not continue training after pretraining on augmented corpus

I continue testing the pipeline and I see that almost all teacher models don't continue training even after I increased patience by setting `early-stopping: 20`. Currently, continuation happens by training...

bug

quality

Do not continue training if evaluation quality is too low

We need this to prevent further training if there is a bug. We can add an assert to the evaluation script. It will check that metrics are higher than some...

enhancement

Bicleaner won't work on HPC because of time limits

I see that bicleaner-ai takes more time than 36 hours for some large datasets on pretty good GPU. This really depends on GPU model on HPC. Maybe it it's A100...

HPC

Group jobs request too many cores on slurm

This issue is important only for HPC training where we don't want jobs to be too small, so we have to group them. It is even beneficial to have smaller...

HPC

Improve tensorboard

1. Better integrate with the pipeline settings 2. Automatically discover models in MODELS_DIR 3. Remove intermediate file 4. Do not require to restart the script when a new model was...

enhancement

Handle soft hyphens with custom normalization tables

Ulrich: >The SentencePiece tokenizer should probably be trained with a custom normalization table (see the SentencePiece documentation) that removes soft hyphens in addition to the existing normalization steps. It requires...

good first issue

quality

Move artifacts from S3 to GCS

enhancement

Read time out on uploading a large collection with mmap

## Current Behavior I'm uploading a large number of vectors to mmap collection with disabled indexing and getting `timeout: The read operation timed out`. I managed to upload 25667264 160-dimensional...

bug