bilm-tf
bilm-tf copied to clipboard
Perplexity per sentence implementation?
I need to get ppl per sentence for millions of lines. Splitting them into files each containing one sentence would be time consuming. Is it possible to achieve this by modifying dataloader? For example, give the model input as (num_sentences, num_tokens, max_characters_per_token) . The problem is how to pad sentences that doesn't have enough tokens. If this would work, will such padding affect state for next batch? If not, any other suggestions?
add batch_losses to append losses can get one batch sentences ppl