metadata
metadata copied to clipboard
Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
I propose in this PR to change the way in which local tags are added to the text. ## Motivation HTML defines a tree. In this sense, it is important...
# Known Issues ~~I haven't really started debugging issues below.~~ ## CPU/GPU This is caused by the breaking change of `torch 1.9.0`, so downgrading to `torch 1.8.1` (or perhaps 1.8.2)...
the dataset can be downloaded here: ```sh wget https://huggingface.co/datasets/ttj/metadata_arxiv/resolve/main/v0.jsonl ``` I haven't hooked it up with `input_pipeline`, so it's not runnable now. Because I can't decide right now whether to...