metadata icon indicating copy to clipboard operation
metadata copied to clipboard

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.

Results 38 metadata issues
Sort by recently updated
recently updated
newest added

When finished share with Christopher and Shanya.

ppl on website specific testset. Contact @cccntu and Christopher

Imitating https://huggingface.co/docs/accelerate/quicktour.html#training-on-tpu I also see that there's also a padding procedure in https://github.com/bigscience-workshop/metadata/blob/master/bsmetadata/experiments/sample.py#L16 Not really sure: - what we want here for both TPU and GPU (which can be longest...

enhancement

When done e-mail @VictorSanh Find temaplte here: https://github.com/bigscience-workshop/metadata/tree/master/experiments/jz/templates/SLURM

While testing the real data extraction, I encountered a new problem: the websites descriptions are rarely present in the `metadata_website_desc` column. Therefore, the datasets library cannot load such a dataset...