metaseq icon indicating copy to clipboard operation
metaseq copied to clipboard

Make eval_lm.py work with .jsonl formatted data

Open danielsimig opened this issue 2 years ago • 1 comments

Currently eval_lm.py requires data to be in the legacy format (.bin, .idx files and a dict.txt). This is annoying because all of my data is in the jsonl format and pre-processing them into the legacy format would be a lot of work - it would be much better if I could directly measure perplexity on any of my current validation sets as is.

I dug into the code a bit actually and it seemed that the issue is that eval_lm instantiates a language_modeling task, but it is streaming_language_modeling task that uses jsonl. So I went ahead and hacked a code minimally to see if I can fix this up: https://fburl.com/phabricator/a2rrxmvp

Turns out if I set all the flags similarly to how I'd set extra flags in model_configs.py this actually works! Here is an example run: https://fburl.com/phabricator/6fy7j7cp

The produced perplexity number, however, is way off from what I saw during training (< 50) and more similar to numbers I saw during gpt3_eval that I'm debugging.

I am looking for help from people actually familiar with the codebase. Is this a legit way to use eval_lm? Are the resulting numbers credible? That is, are language modeling and streaming language modeling tasks compatible in this way or am I missing something maybe?

danielsimig avatar Mar 31 '22 22:03 danielsimig

This seems right - here's a paper trail of directionally moving evals to streaming lm task: https://github.com/fairinternal/metaseq-internal/issues/54

I can't speak to the result of evals though, but replacing language modeling task with streaming lm should be the right move here so that we don't have a separate codepath between evals vs training.

suchenzang avatar Apr 01 '22 01:04 suchenzang