metadata
metadata copied to clipboard
Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
When finished share with Christopher and Shanya.
ppl on website specific testset. Contact @cccntu and Christopher
Imitating https://huggingface.co/docs/accelerate/quicktour.html#training-on-tpu I also see that there's also a padding procedure in https://github.com/bigscience-workshop/metadata/blob/master/bsmetadata/experiments/sample.py#L16 Not really sure: - what we want here for both TPU and GPU (which can be longest...
When done e-mail @VictorSanh Find temaplte here: https://github.com/bigscience-workshop/metadata/tree/master/experiments/jz/templates/SLURM
While testing the real data extraction, I encountered a new problem: the websites descriptions are rarely present in the `metadata_website_desc` column. Therefore, the datasets library cannot load such a dataset...