metadata
metadata copied to clipboard
Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
- Add full perplexity evaluation into eval loop. - Minor Training code changes I have been using. - Make sure tokenizer is saved.
fairer without metadata. If you keep the metadata_probability the same, the idea is to give the same context in each example in a without_metadata run.
this PR is for visibility only, it's not intended to be merged.
Hey @cccntu, I checked the stuff. I don't know if I have to add test results somewhere or if this commit is enough.
At the moment, only one option is available: install all dependencies for preprocessing or don't install them. A user might want to install only the dependencies for one type of...
As discussed a long time ago in a meeting it would be really great if we had a feature to save the model and stop training after a certain time...
Steps: - [x] pseudo crawl ~10% of C4 web page from Common Crawl @tianjianjiang - [x] import pseudo crawled dataset on JZ @SaulLu - [x] run 1st step of extraction:...