Long-Context-Data-Engineering
Long-Context-Data-Engineering copied to clipboard
Upsampling: Statistical biasas of distribution of dataset
I think there are some statistical biases in this implementation for long context engineering.
Concern 1:
For upsample
mode, some datasets groups get filtered
when their capacity is maxed out. e.g for --down_sample_mode=upsample_code_arxiv_book
, the code, arxiv and book datasets will be mostly at the end of our created syntetic dataset.
Concern 2:
Start token_id 1
. With the llama-tokenizer, when a single passage is tokenized, it is started by <s>
or token_id1
. When concetenating different pretokenized texts, its not the same result as if the strings are added and then tokenized together.