[FEA] DLLabelEncoder kwarg for raising errors on out-of-vocabulary

Open benfred opened this issue 5 years ago • 0 comments

Issue by alecgunny Tuesday Mar 24, 2020 at 19:44 GMT Originally opened as https://github.com/rapidsai/recsys/issues/29

Is your feature request related to a problem? Please describe. DLLabelEncoder by default reserves 0 for missing or out-of-vocabulary entries. While this is sensible default behavior, I can imagine scenarios where you know explicitly all the categories beforehand, and any sample with a value outside of these categories is problematic and should raise an error. In this case, the categories would map to [0, num_categories-1].

Describe the solution you'd like Add a kwarg to DLLabelEncoder that can toggle this behavior. One possibility, used by TensorFlow's tf.feature_column.categorical_column_with_vocabulary_list is a num_oov_buckets kwarg that defaults to 1, but can be set to 0 indicating that no out-of-vocabulary inputs should be tolerated.

As a possible, but not strictly necessary, addition, higher values can be used to hash oov inputs into different bins. In this case, unclear whether to assign oov to the first num_oov_buckets integers or the range [num_categories, num_categories+num_oov_buckets-1].

Describe alternatives you've considered I'm open to the argument that in the case that out-of-vocabulary are unacceptable, the onus is on the data scientist to make sure of this when feeding data in. But it feels like a silent failure, which isn't desirable. It also forces them to reserve the category 0 for a value that will never come, which can be a minor inconvenience.

Jun 04 '20 23:06 benfred