texar-pytorch Inconsistent behavior between hparams.dataset.bos_token/vocab.bos_token and hparams.dataset.eos_token/vocab.eos

Inconsistent behavior between hparams.dataset.bos_token/vocab.bos_token and hparams.dataset.eos_token/vocab.eos_token

Open AvinashBukkittu opened this issue 6 years ago • 2 comments

Currently, we set vocab.bos_token and vocab.eos_token to their default values (<BOS> and <EOS> respectively) when bos_token and eos_token in hparams are set to empty string. This creates inconsistencies between the two sets of variables. In the process method, we use the special tokens from the hparams.

One possible use case of this happening is when the input files already contain bos and eos tokens and the user does not want any additional tokens to be added during processing.

Originally posted by @huzecong in https://github.com/asyml/texar-pytorch/pull/53

Jun 24 '19 03:06 AvinashBukkittu

Does Texar-TF has the same issue?

Are vocab.bos_token and vocab.eos_token used anywhere internally? In Texar-TF, I guess they're only for users to use

Jun 25 '19 16:06 ZhitingHu

@ZhitingHu This part of implementation is the same with texar-TF. In dataset processing methods only those from the hparams are used, I don't think vocab.bos_token and vocab.eos_token were used anywhere internally.

But this inconsistency might confuse the user. What I would suggest is to add additional fields into hparams, for example, boolean flags prepend_bos and append_eos. This would be more clear compared to setting bos_token to '' to disable prepending BOS.

Jun 25 '19 16:06 huzecong

texar-pytorch texar-pytorch copied to clipboard

Inconsistent behavior between hparams.dataset.bos_token/vocab.bos_token and hparams.dataset.eos_token/vocab.eos_token

texar-pytorch
texar-pytorch copied to clipboard