nn4nlp-code
nn4nlp-code copied to clipboard
Computing the number of words
Most files share similar data reading code, like
https://github.com/neubig/nn4nlp-code/blob/a9e8be5b101cc1de50f27d918187d6271fc26c8d/01-intro/cbow.py#L18-L22
In most of the examples, the variable nwords
is used as the effective vocabulary size, for instance, when we allocate parameters for embedding matrix.
https://github.com/neubig/nn4nlp-code/blob/a9e8be5b101cc1de50f27d918187d6271fc26c8d/01-intro/cbow.py#L30
However, there are likely many new words in dev/test set that might be added in w2i
... their values are mapped to UNK
, but they are still counted in len(w2i)
which is likely not intended. Often this overcounting does not change the results, but it can be problematic in some cases.