texar-pytorch
texar-pytorch copied to clipboard
Inconsistent behavior between hparams.dataset.bos_token/vocab.bos_token and hparams.dataset.eos_token/vocab.eos_token
Currently, we set vocab.bos_token and vocab.eos_token to their default values (<BOS> and <EOS> respectively) when bos_token and eos_token in hparams are set to empty string. This creates inconsistencies between the two sets of variables. In the process method, we use the special tokens from the hparams.
One possible use case of this happening is when the input files already contain bos and eos tokens and the user does not want any additional tokens to be added during processing.
Originally posted by @huzecong in https://github.com/asyml/texar-pytorch/pull/53
Does Texar-TF has the same issue?
Are vocab.bos_token and vocab.eos_token used anywhere internally? In Texar-TF, I guess they're only for users to use
@ZhitingHu This part of implementation is the same with texar-TF. In dataset processing methods only those from the hparams are used, I don't think vocab.bos_token and vocab.eos_token were used anywhere internally.
But this inconsistency might confuse the user. What I would suggest is to add additional fields into hparams, for example, boolean flags prepend_bos and append_eos. This would be more clear compared to setting bos_token to '' to disable prepending BOS.