detext icon indicating copy to clipboard operation
detext copied to clipboard

how to generate wide sparse features

Open kiminh opened this issue 3 years ago • 2 comments

Hi,I'm confused about how to generate the wide sparse features. Here is my understanding: combine the multi field categorical features together and form the multi hot sparse feature. then the index is generated by hash value or simliar way like the labelencode way?

kiminh avatar Nov 29 '21 15:11 kiminh

I mean every single field categorical feature has its vocabulary, then multiple field categorical features have multiple vocabularies. then the vocabulary of the multi hot sparse feature is the union set of multiple vocabularies, and index the multiple field categorical feature. Or just use the hash way to index the categorical feature like string "field_name:categorical feature value", this way may have some conflicts but don't have to maintain the whole vocabulary.

kiminh avatar Nov 29 '21 15:11 kiminh

Hi @kiminh, I assume that your question is based on DeText-TF2. In DeText TF2, each sparse feature field (wide part) is a multi hot vector. This vector should be generated by user beforehand (e.g. hashing). The vocab size can be passed to DeText through nums_sparse_ftrs.

The vocab for each field is independent of each other. There's no correlation between them.

StarWang avatar Dec 01 '21 06:12 StarWang