conversationai-models icon indicating copy to clipboard operation
conversationai-models copied to clipboard

Create a token/embedding creation preprocessing pipeline using tf-transform

Open iislucas opened this issue 6 years ago • 1 comments

Issue: We currently depend on vocabularies, like glove embeddings, that are:

  1. Weirdly biased (although when you backprop to the embeddings, their initial bias is not very relevant anymore),
  2. Depend on being consistent with the tokenizer we use.
  3. Don't necessarily have the same words as our actual text.

Proposed solution project: Use https://github.com/tensorflow/transform to develop text preprocessing pipelines, e.g. to select tokens that occur sufficiently frequently, and create either random or smarter word embeddings for them.

iislucas avatar Jul 02 '18 21:07 iislucas

FYI: Not sure if that helps but here is a basic example with tft: https://github.com/tensorflow/transform/blob/master/examples/sentiment_example.py

fprost avatar Jul 17 '18 16:07 fprost