bilm-tf icon indicating copy to clipboard operation
bilm-tf copied to clipboard

Perplexity per sentence implementation?

Open BigBorg opened this issue 5 years ago • 1 comments

I need to get ppl per sentence for millions of lines. Splitting them into files each containing one sentence would be time consuming. Is it possible to achieve this by modifying dataloader? For example, give the model input as (num_sentences, num_tokens, max_characters_per_token) . The problem is how to pad sentences that doesn't have enough tokens. If this would work, will such padding affect state for next batch? If not, any other suggestions?

BigBorg avatar Jul 19 '19 05:07 BigBorg

image

add batch_losses to append losses can get one batch sentences ppl

demeiyan avatar Sep 10 '20 01:09 demeiyan