metadata icon indicating copy to clipboard operation
metadata copied to clipboard

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.

Results 38 metadata issues
Sort by recently updated
recently updated
newest added

- Add full perplexity evaluation into eval loop. - Minor Training code changes I have been using. - Make sure tokenizer is saved.

fairer without metadata. If you keep the metadata_probability the same, the idea is to give the same context in each example in a without_metadata run.

this PR is for visibility only, it's not intended to be merged.

Hey @cccntu, I checked the stuff. I don't know if I have to add test results somewhere or if this commit is enough.

At the moment, only one option is available: install all dependencies for preprocessing or don't install them. A user might want to install only the dependencies for one type of...

enhancement

As discussed a long time ago in a meeting it would be really great if we had a feature to save the model and stop training after a certain time...

Steps: - [x] pseudo crawl ~10% of C4 web page from Common Crawl @tianjianjiang - [x] import pseudo crawled dataset on JZ @SaulLu - [x] run 1st step of extraction:...

#dataset
Epic