yurakuratov comments

Results 14 comments of


                                            yurakuratov

Training and tokenization code

Hi! We followed [Megatron-LM BERT data pipeline](https://github.com/NVIDIA/Megatron-LM#data-preprocessing) for pretraining. We trained tokenizer with BpeTrainer from [HF Tokenizers](https://huggingface.co/docs/tokenizers/index). The actual data pre-processing code that we used is currently placed here [https://github.com/yurakuratov/t5-experiments/tree/expr/genomes/megatron](https://github.com/yurakuratov/t5-experiments/tree/expr/genomes/megatron)....

Training and tokenization code

> One thing I'm curious about in your dataloading pipeline in general for genomics, do you typically load data on the fly and tokenize, or do you tokenize the entire...

Training and tokenization code

BTW BigBird model is public now: https://huggingface.co/AIRI-Institute/gena-lm-bigbird-base-t2t

Training and tokenization code

Thank you for your interest! [gena-lm-bert-base](https://huggingface.co/AIRI-Institute/gena-lm-bert-base) train set is about 500k documents. [gena-lm-bigbird-base-t2t](https://huggingface.co/AIRI-Institute/gena-lm-bigbird-base-t2t) train set is about 8.5M documents making in total ~425B base pairs / characters. We followed BigBird...