NEKO Better concatenation and individual metrics when using multiple text datasets

Better concatenation and individual metrics when using multiple text datasets

Open bhavul opened this issue 1 year ago • 5 comments

For text task, when we would have multiple datasets, concatenation strategy could be moved to a more sophisticated logic by using huggingface concatenation.

Further, we may wish to change the evaluation loop to also give metrics individual to each dataset besides the average.

The text task looks good so far, I am curious about the choice / what you think is the best way to handle having multiple datasets. Are there speed benefits following the process here, of concatenating the datasets? If we had separate tasks, then we would also want to calculate the total tokens for each task / proportionally calculate how much of each batch comes from each task depending on # of tokens, but we don't have to worry about this if following your procedure. It seems like there is an edge case where the concatenation will not work if the columns are not named the same: https://huggingface.co/docs/datasets/process#concatenate

One thing that may be useful, is that if we have multiple datasets which are concatenated, is during evaluation, is to determine specific metrics associated w/ each separate dataset. E.g., want a separate perplexity score for wikitext vs the pile, not just the average between both. Potentially, after contenating, can maybe maintain start and end indices for each dataset, e.g. pile is 0 to 200mil, other dataset is (200mil + 1) to 400mil, so we can attribute which samples correspond to each task, separately aggregate their metrics.

Another strategy is that during training, we just track the average, but after training finishes, you essentially load model, e.g. eval.py, just running over each of your tasks separately, where text_datasets={the specific dataset you want your eval metrics over}, but may be inconvenient

Originally posted by @daniellawson9999 in https://github.com/ManifoldRG/NEKO/pull/1#discussion_r1299509872

Oct 09 '23 04:10 bhavul

NEKO NEKO copied to clipboard

Better concatenation and individual metrics when using multiple text datasets

NEKO
NEKO copied to clipboard