transformer-xl icon indicating copy to clipboard operation
transformer-xl copied to clipboard

Generation script

Open chrisdonahue opened this issue 5 years ago • 16 comments

Can you include a simple script for generating text with a pretrained Transformer-XL language model? We are primarily using the PyTorch codebase but I am sure Tensorflow users would also appreciate this example.

If including this script is outside the scope of the project repository, could an informal example be provided in this issue thread?

chrisdonahue avatar Mar 11 '19 21:03 chrisdonahue

To be specific, we are looking for an example of true next-step autoregressive prediction (temperature=1) from the learned language model which properly aggregates the memory states across time.

chrisdonahue avatar Mar 11 '19 22:03 chrisdonahue

Perhaps the more productive way to go about this would be to provide the code we have written. Could you please tell us if there are any glaring issues? Any thoughts you have would be highly appreciated :)

First, we modified mem_transformer.py to add a (hacky) forward pass which returns logits rather than log likelihoods:

class MemTransformerLM(nn.Module):
...
    def forward_generate(self, data, *mems):
        if not mems: mems = self.init_mems()

        tgt_len = data.size(0)
        batch_size = data.size(1)
        hidden, new_mems = self._forward(data, mems=mems)

        pred_hid = hidden[-tgt_len:]

        assert self.crit.n_clusters == 0

        logits = self.crit._compute_logit(
            pred_hid.view(-1, pred_hid.size(-1)),
            self.crit.out_layers[0].weight,
            self.crit.out_layers[0].bias,
            self.crit.out_projs[0])
        logits = logits.view(tgt_len, batch_size, -1)

        if new_mems is None:
            return [logits]
        else:
            return [logits] + new_mems

Then, we wrote this generation script to use it in an autoregressive sampling loop:

import torch
import torch.nn.functional as F
import numpy as np

from mem_transformer import MemTransformerLM

MODEL_FP = 'pretrained/model.pt'
USE_CUDA = True
BATCH_SIZE = 1
TGT_LEN = 1
EXT_LEN = 0
MEM_LEN = 2000
CLAMP_LEN = 1000
GEN_LEN = 4000
SAME_LENGTH = True

device = torch.device("cuda" if USE_CUDA else "cpu")

# Load the best saved model
with open(MODEL_FP, 'rb') as f:
    model = torch.load(f)
model.backward_compatible()
model = model.to(device)

# Make sure model uses vanilla softmax
if model.sample_softmax > 0:
  raise NotImplementedError()
if model.crit.n_clusters != 0:
  raise NotImplementedError()

# Change training length/memory attrs
model.reset_length(TGT_LEN, EXT_LEN, MEM_LEN)
if CLAMP_LEN > 0:
  model.clamp_len = CLAMP_LEN
if SAME_LENGTH:
  model.same_length = True

# Turn on evaluation mode which disables dropout.
model.eval()

# Generate sequences of specified length and number
with torch.no_grad():
  # Create buffer for generated sequences
  samples = torch.zeros([0, BATCH_SIZE], dtype=torch.int64).to(device)

  # Initialize state
  prev_token = torch.zeros([TGT_LEN, BATCH_SIZE], dtype=torch.int64).to(device)
  mems = tuple()

  # Autoregressive sampling
  for i in range(GEN_LEN):
    ret = model.forward_generate(prev_token, *mems)

    # Retrieve logits and memory
    logits, mems = ret[0], ret[1:]

    # Ignore <S> (end of sequence) logit
    logits = logits[:, :, 1:]

    # Compute probabilities
    probs = F.softmax(logits, dim=-1)

    # Sample from probabilities
    sampler = torch.distributions.categorical.Categorical(probs=probs)
    token = sampler.sample()

    # Shift by one because we ignored <S> earlier
    token += 1

    # Add new token to buffer and update history
    samples = torch.cat([samples, token], dim=0)
    prev_token = token

# Should be [GEN_LEN, BATCH_SIZE]
print(samples.shape)

Some specific questions:

  • Are we handling the recurrent memory properly?
  • What are appropriate values for CLAMP_LEN?
  • What effect does increasing EXT_LEN have? Is this appropriate in the autoregressive generation context?

chrisdonahue avatar Mar 12 '19 00:03 chrisdonahue

@chrisdonahue I have generation code here https://github.com/lopuhin/transformer-xl/blob/fb11489ca4c6000573d27d5eaca3a641057c0a6a/pytorch/inference.py#L99 which I hope is correct (although not 100% sure). Also this branch contains a bunch of other changes. Also I observed that to get good quality of samples, more than a few words of context are required for the models I trained.

What effect does increasing EXT_LEN have? Is this appropriate in the autoregressive generation context?

I think it's fine to keep it at 0, in some other issues authors explained that this was an experiment which they didn't carry on even to TF version.

lopuhin avatar Mar 12 '19 11:03 lopuhin

@lopuhin Thank you for replying and providing code.

After a cursory review, it appears you are starting with a priming sequence, feeding a slice of length N = model.tgt_len, and then sampling from the last timestep of the model log probs to generate the next token. Then, you feed in another slice of length N consisting of N-1 real tokens and 1 generated token, then N-2 real tokens and 2 generated tokens, etc. You also made a nearly identical modification to the one I made which alters ProjectedAdaptiveLogSoftmax to return log probabilities when not provided a target.

Is my understanding correct? If not, can you explain how it deviates?

Have you been able to generate desirable samples using this method (in a qualitative sense)?

Have you tried generating without using a priming sequence? If so, did you pass in a buffer of N zeros as the priming sequence to keep the context size (tgt_len) constant? Or did you grow the model's context size dynamically as the sequence is being generated?

Very curious to also hear from the author (@kimiyoung ) on the ideal way to generate a truly random sample from the Transformer-XL language model.

chrisdonahue avatar Mar 12 '19 20:03 chrisdonahue

Actually, the XL paper was submitted to some conference that does not allow further public PR or update of the paper related content, which includes the generation results of XL. So, at this moment, an easy way of playing with XL-generation is to rely on a third-party API pytorch-pretrained-bert.

After installing the third-party API, you can use a snippet from the wiki103 test set as the context and generate some novel text. An example code is as follows:

import sys, os
from io import open
import numpy as np
import torch
import argparse
from pytorch_pretrained_bert import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel

import logging
logging.basicConfig(level=logging.INFO)

parser = argparse.ArgumentParser()
parser.add_argument('--max_ctx_len', type=int, default=512, help='')
parser.add_argument('--max_gen_len', type=int, default=512, help='')
parser.add_argument('--topk', type=int, default=40, help='')
parser.add_argument('--start_idx', type=int, default=-1, help='')
parser.add_argument('--out_path', type=str, default='output.txt', help='')
parser.add_argument('--inp_path', type=str, required=True,
    help='path to the WT-103 test.txt file.')
args = parser.parse_args()

def format_text(tokens):
  line = ''
  for token in tokens:
    if token == '<eos>':
      line += '\n'
    else:
      line += token
      line += ' '

  # simple rules of detokenization
  line = line.replace(' @-@ ', '-')
  line = line.replace(' @,@ ', ',')
  line = line.replace(' @.@ ', '.')
  line = line.replace(' . ', '. ')
  line = line.replace(' , ', ', ')
  line = line.replace(' : ', ': ')
  line = line.replace(' ; ', '; ')
  line = line.replace(" 's ", "'s ")
  line = line.replace(' ( ', ' (')
  line = line.replace(' ) ', ') ')

  return line

# Load pre-trained model tokenizer (vocabulary from wikitext 103)
tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
for idx, sym in enumerate(tokenizer.idx2sym):
  tokenizer.idx2sym[idx] = sym.encode('latin1').decode('utf-8')

with open('input.txt', 'r', encoding='utf-8') as f:
  lines = [l.strip().split() + ['<eos>'] for l in f.readlines()]

# Randomly choose some lines
num_lines = len(lines)

context, reference = [], []

if args.start_idx < 0:
  args.start_idx = np.random.randint(0, num_lines - 40)

idx = args.start_idx
while idx < num_lines:
  context += lines[idx]
  idx += 1
  if len(context) >= args.max_ctx_len:
    break

while idx < num_lines:
  reference += lines[idx]
  idx += 1
  if len(reference) >= args.max_gen_len:
    break

while len(context) > args.max_ctx_len:
  reference.insert(0, context.pop())

# Convert token to vocabulary indices
ctx_tensor = torch.tensor([tokenizer.convert_tokens_to_ids(context)])

# Load pre-trained model (weights)
model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')
model.eval()

# If you have a GPU, put everything on cuda
ctx_tensor = ctx_tensor.to('cuda')
model.to('cuda')

unk_id = tokenizer.convert_tokens_to_ids(['<unk>'])[0]

with torch.no_grad():
  # Predict all tokens
  tensor = ctx_tensor
  generation = []
  for i in range(args.max_gen_len):
    if i == 0:
      log_prob, mems = model(tensor)
    else:
      log_prob, mems = model(tensor, mems=mems)

    prob = torch.exp(log_prob[0, -1, :])
    prob[unk_id].data.fill_(0.)

    # sample from the top-k tokens
    top_prob, top_index = torch.topk(prob, args.topk)
    token = torch.multinomial(top_prob, 1)
    token = top_index[token]

    tensor = token.detach().view(1, 1)

    symbol = tokenizer.get_sym(token.item())

    generation.append(symbol)

with open(args.out_path, 'w', encoding='utf-8') as f:
  f.write('Start line: {}'.format(args.start_idx) + '\n')
  f.write('Context len: {}'.format(len(context)) + '\n')
  f.write('-' * 80 + '\n')
  f.write(format_text(context) + '\n')
  f.write('-' * 80 + '\n')
  f.write(format_text(generation) + '\n')
  f.write('-' * 80 + '\n')
  f.write(format_text(reference[:args.max_gen_len]) + '\n')

zihangdai avatar Mar 12 '19 22:03 zihangdai

Is my understanding correct? If not, can you explain how it deviates?

@chrisdonahue yes, it is correct, the only difference is that I'm growing the sequence that I'm feeding.

Have you been able to generate desirable samples using this method (in a qualitative sense)?

Yes, to a certain degree. I was only experimenting on my own models, trained on Russian news, literature and subtitles. The model is able to generate long coherent text with a few slips here and there. But often the model quickly finishes the prime sentence and goes off to generate text which seems not related to what I primed it with and more similar to training corpus content, I'm not sure why yet (but I didn't spend much time on it yet).

Have you tried generating without using a priming sequence? If so, did you pass in a buffer of N zeros as the priming sequence to keep the context size (tgt_len) constant? Or did you grow the model's context size dynamically as the sequence is being generated?

I do grow the context size dynamically, but I didn't try generating without a priming sequence. I tried priming it with just a newline (<eos>), but then it went to produce only <eos> :)

For me text generation is not a primary goal, but if I do find something useful, will share here.

lopuhin avatar Mar 13 '19 06:03 lopuhin

@lopuhin Awesome many thanks for the clarification!

chrisdonahue avatar Mar 13 '19 06:03 chrisdonahue

Hello, thanks for the calification and examples. is there any tensorflow example ?

MoAbd avatar Mar 28 '19 11:03 MoAbd

@chrisdonahue Hello.I'm also trying to generate sentence by start with only 1 character(word),but the effect is not good? Do you have the same problem? @zihangdai I noticed that your input is a sentence,does it mean this model cannot generate correct sentence by input only 1 charater(word)?

77281900000 avatar Mar 29 '19 03:03 77281900000

@77281900000 I got pretty decent results with one-at-a-time generation. Sometimes, like with most sequence models, the outputs would collapse to very repetitive subsequence. I would recommend lowering the temperature slightly and using top-k sampling (I got decent performance with temp=0.95 and k=32).

chrisdonahue avatar Apr 17 '19 18:04 chrisdonahue

Is there an equivalent version for Tensorflow implementation? Looking for a script to generate text from a trained TransformerXL model.

Thanks!

juggyj avatar Apr 28 '19 06:04 juggyj

Thank you! I did play with this pytorch version and was able to generate text using the script below. However, the pretrained model format seems to be bit different from the model created using TransformerXL script. The model I am trying to load is “model.pt” – appears to be a different representation from the pretrained model being downloaded in the script below.

Appreciate any help.

  • regards
  • juggy

O: x2002 C: (304) 685-3298

From: Zihang Dai [email protected] Sent: Thursday, July 18, 2019 2:06 PM To: kimiyoung/transformer-xl [email protected] Cc: Juggy Jagannathan [email protected]; Comment [email protected] Subject: Re: [kimiyoung/transformer-xl] Generation script (#49)

[External]

Actually, the XL paper was submitted to some conference that does not allow further public PR or update of the paper related content, which includes the generation results of XL. So, at this moment, an easy way of playing with XL-generation is to rely on a third-party API pytorch-pretrained-berthttps://github.com/huggingface/pytorch-pretrained-BERT.

After installing the third-party API, you can use a snippet from the wiki103 test set as the context and generate some novel text. An example code is as follows:

import sys, os

from io import open

import numpy as np

import torch

import argparse

from pytorch_pretrained_bert import TransfoXLTokenizer, TransfoXLModel, TransfoXLLMHeadModel

import logging

logging.basicConfig(level=logging.INFO)

parser = argparse.ArgumentParser()

parser.add_argument('--max_ctx_len', type=int, default=512, help='')

parser.add_argument('--max_gen_len', type=int, default=512, help='')

parser.add_argument('--topk', type=int, default=40, help='')

parser.add_argument('--start_idx', type=int, default=-1, help='')

parser.add_argument('--out_path', type=str, default='output.txt', help='')

parser.add_argument('--inp_path', type=str, required=True,

help='path to the WT-103 test.txt file.')

args = parser.parse_args()

def format_text(tokens):

line = ''

for token in tokens:

if token == '<eos>':

  line += '\n'

else:

  line += token

  line += ' '

simple rules of detokenization

line = line.replace(' @-@ ', '-')

line = line.replace(' @,@ ', ',')

line = line.replace(' @.@ ', '.')

line = line.replace(' . ', '. ')

line = line.replace(' , ', ', ')

line = line.replace(' : ', ': ')

line = line.replace(' ; ', '; ')

line = line.replace(" 's ", "'s ")

line = line.replace(' ( ', ' (')

line = line.replace(' ) ', ') ')

return line

Load pre-trained model tokenizer (vocabulary from wikitext 103)

tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')

for idx, sym in enumerate(tokenizer.idx2sym):

tokenizer.idx2sym[idx] = sym.encode('latin1').decode('utf-8')

with open('input.txt', 'r', encoding='utf-8') as f:

lines = [l.strip().split() + [''] for l in f.readlines()]

Randomly choose some lines

num_lines = len(lines)

context, reference = [], []

if args.start_idx < 0:

args.start_idx = np.random.randint(0, num_lines - 40)

idx = args.start_idx

while idx < num_lines:

context += lines[idx]

idx += 1

if len(context) >= args.max_ctx_len:

break

while idx < num_lines:

reference += lines[idx]

idx += 1

if len(reference) >= args.max_gen_len:

break

while len(context) > args.max_ctx_len:

reference.insert(0, context.pop())

Convert token to vocabulary indices

ctx_tensor = torch.tensor([tokenizer.convert_tokens_to_ids(context)])

Load pre-trained model (weights)

model = TransfoXLLMHeadModel.from_pretrained('transfo-xl-wt103')

model.eval()

If you have a GPU, put everything on cuda

ctx_tensor = ctx_tensor.to('cuda')

model.to('cuda')

unk_id = tokenizer.convert_tokens_to_ids([''])[0]

with torch.no_grad():

Predict all tokens

tensor = ctx_tensor

generation = []

for i in range(args.max_gen_len):

if i == 0:

  log_prob, mems = model(tensor)

else:

  log_prob, mems = model(tensor, mems=mems)



prob = torch.exp(log_prob[0, -1, :])

prob[unk_id].data.fill_(0.)



# sample from the top-k tokens

top_prob, top_index = torch.topk(prob, args.topk)

token = torch.multinomial(top_prob, 1)

token = top_index[token]



tensor = token.detach().view(1, 1)



symbol = tokenizer.get_sym(token.item())



generation.append(symbol)

with open(args.out_path, 'w', encoding='utf-8') as f:

f.write('Start line: {}'.format(args.start_idx) + '\n')

f.write('Context len: {}'.format(len(context)) + '\n')

f.write('-' * 80 + '\n')

f.write(format_text(context) + '\n')

f.write('-' * 80 + '\n')

f.write(format_text(generation) + '\n')

f.write('-' * 80 + '\n')

f.write(format_text(reference[:args.max_gen_len]) + '\n')

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/kimiyoung/transformer-xl/issues/49?email_source=notifications&email_token=AD4KI5JMBZMFNCO775ATZ4LQACWJJA5CNFSM4G5GRGJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2JJ5SQ#issuecomment-512925386, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD4KI5OVRJ67DDZIFOQNPZDQACWJJANCNFSM4G5GRGJA.


[Stop. Think. Read. This is an external email. Please use caution when clicking on the links and opening attachments in unsolicited email.]


juggyj avatar Jul 18 '19 18:07 juggyj

@chrisdonahue with regards to the script you have written above, when MEM_LEN has been set to 0, this error occurs: RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/THCReduceAll.cuh:317

My text input and output is small, only 128 bytes, and there is no such need of memory here. Could you suggest what is going wrong here?

mvedang avatar Aug 08 '19 12:08 mvedang

It doesn't make much sense to set MEM_LEN to 0. The model was trained with memory.

chrisdonahue avatar Aug 11 '19 00:08 chrisdonahue

hello, could you tell me what is the meaning of "temperature" in your comments? @chrisdonahue

echoyes avatar Nov 05 '19 03:11 echoyes

Temperature refers to manually changing the entropy or "randomness" of the model's predictions. A temperature of 1 implies that you are directly sampling from the model's probability distribution. A temperature of 0 implies "greedy sampling", always taking the most likely token at each timestep (only one possible output from the model). In general, lower temperatures are less random and higher temperatures are more random.

chrisdonahue avatar Nov 05 '19 13:11 chrisdonahue