Sentence-VAE icon indicating copy to clipboard operation
Sentence-VAE copied to clipboard

Repeated content

Open nguyenvo09 opened this issue 7 years ago • 2 comments

I used your code and trained a model to generate new sentences. The problem is that there are so many repeated tokens in generated samples.

Any insight how to deal with this?

For example, token appears so many times.

https://pastebin.com/caxz43CQ

nguyenvo09 avatar Sep 04 '18 18:09 nguyenvo09

For how long did you train? What was your final KL/NLL Loss? Also with what min_occ did you train?

Also, when looking at it, the samples actually don't look that bad. Certainly, there is a problem with <unk> tokens, that they might be repeated many times before finally an <eos> token is produced. However, I think that's expected, since the network really does not know what <unk> is, so there actually can be any number of <unk>'s. I think if you move on to another dataset, where the training and validation set are more similar, you should have less <unk>'s produced.

timbmg avatar Sep 05 '18 09:09 timbmg

Is that seq2seq-like model you want to implement? I have the same problem met. It seems like when training, the input of the decoder also have to be sorted by length. While in the evaluation part, we do not have prior knowledge of the lengths of the sentences we want to generate, so, this part of the information is kind of lost. Also, it seems seq2seq-like decoder could only be implemented by RNNLM, is that true?(like the code below:

            t = 0
            while(t < self.max_sequence_length-1):
                if t == 0:
                    input_sequence = Variable(torch.LongTensor([self.sos_idx] * batch_size), volatile=True)
                    if torch.cuda.is_available():
                        input_sequence = input_sequence.cuda()
                        outputs        = outputs.cuda()

                input_sequence  = input_sequence.unsqueeze(1)
                input_embedding = self.embedding(input_sequence) # b * e
                output, hidden  = self.decoder_rnn(input_embedding, hidden) 
                logits          = self.outputs2vocab(output) # b * v
                outputs[:,t,:]  = nn.functional.log_softmax(logits, dim=-1).squeeze(1)  # b * v 
                input_sequence  = self._sample(logits)
                t += 1

            outputs = outputs.view(batch_size, self.max_sequence_length, self.embedding.num_embeddings)

preke avatar Apr 05 '19 05:04 preke