nn4nlp-code Computing the number of words

Computing the number of words

Open danishpruthi opened this issue 5 years ago • 0 comments

Most files share similar data reading code, like

https://github.com/neubig/nn4nlp-code/blob/a9e8be5b101cc1de50f27d918187d6271fc26c8d/01-intro/cbow.py#L18-L22

In most of the examples, the variable nwords is used as the effective vocabulary size, for instance, when we allocate parameters for embedding matrix.

https://github.com/neubig/nn4nlp-code/blob/a9e8be5b101cc1de50f27d918187d6271fc26c8d/01-intro/cbow.py#L30

However, there are likely many new words in dev/test set that might be added in w2i... their values are mapped to UNK, but they are still counted in len(w2i) which is likely not intended. Often this overcounting does not change the results, but it can be problematic in some cases.

Sep 05 '19 17:09 danishpruthi

nn4nlp-code nn4nlp-code copied to clipboard

Computing the number of words

nn4nlp-code
nn4nlp-code copied to clipboard