yurakuratov
yurakuratov
Hi! We followed [Megatron-LM BERT data pipeline](https://github.com/NVIDIA/Megatron-LM#data-preprocessing) for pretraining. We trained tokenizer with BpeTrainer from [HF Tokenizers](https://huggingface.co/docs/tokenizers/index). The actual data pre-processing code that we used is currently placed here [https://github.com/yurakuratov/t5-experiments/tree/expr/genomes/megatron](https://github.com/yurakuratov/t5-experiments/tree/expr/genomes/megatron)....
> One thing I'm curious about in your dataloading pipeline in general for genomics, do you typically load data on the fly and tokenize, or do you tokenize the entire...
BTW BigBird model is public now: https://huggingface.co/AIRI-Institute/gena-lm-bigbird-base-t2t
Thank you for your interest! [gena-lm-bert-base](https://huggingface.co/AIRI-Institute/gena-lm-bert-base) train set is about 500k documents. [gena-lm-bigbird-base-t2t](https://huggingface.co/AIRI-Institute/gena-lm-bigbird-base-t2t) train set is about 8.5M documents making in total ~425B base pairs / characters. We followed BigBird...