HUSEIN ZOLKEPLI
HUSEIN ZOLKEPLI
I have my own pretrained Pegasus model, now I want to finetune using BigBird, so this is my mapping function, ```python import re import collections def get_assignment_map_from_checkpoint(tvars, init_checkpoint): """Compute the...
rules based normalization bahasa -> ms-en noisy trained translation -> standard en -> en-ms translation. 1. rules based normalization bahasa from `malaya.normalize`. 2. ms-en noisy model, google translate is really...
1. I tried vanilla pytorch training loop using bfloat16, the loss got overflow, https://github.com/mesolitica/malaya/blob/5.1/pretrained-model/mamba/causallm-130m-bf16.ipynb 2. so I tried vanilla pytorch training loop using fp32, the loss is ok, https://github.com/mesolitica/malaya/blob/5.1/pretrained-model/mamba/causallm-130m-fp32.ipynb 3....
- directory, https://github.com/huseinzol05/malaysian-dataset/tree/master/crawl/gov.my/website - dataset, https://huggingface.co/datasets/mesolitica/crawl-gov.my/resolve/main/malaysia.travel.parsed.json
Notebook https://github.com/huseinzol05/malaysian-dataset/blob/master/prepare-llm/calculate-size.ipynb, ```python from bs4 import BeautifulSoup import requests import re import json r = requests.get('https://github.com/users/huseinzol05/projects/1/views/1') soup = BeautifulSoup(r.content, "lxml") data = json.loads(soup.find('script', {'id': 'memex-items-data'}).contents[0]) len(data) parsed = [] for...
Fasttext model trained on, ``` lang_labels_v2 = { 0: 'standard-english', 1: 'local-english', 2: 'manglish', 3: 'standard-indonesian', 4: 'socialmedia-indonesian', 5: 'standard-malay', 6: 'local-malay', 7: 'standard-mandarin', 8: 'local-mandarin', 9: 'other', } ```...
- token counts based on 32k vocab size: 11,607,626,930 - data size: 49.5 GB
Based on this directory https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/zipformer/train.py