conversationai-models
conversationai-models copied to clipboard
Create a token/embedding creation preprocessing pipeline using tf-transform
Issue: We currently depend on vocabularies, like glove embeddings, that are:
- Weirdly biased (although when you backprop to the embeddings, their initial bias is not very relevant anymore),
- Depend on being consistent with the tokenizer we use.
- Don't necessarily have the same words as our actual text.
Proposed solution project: Use https://github.com/tensorflow/transform to develop text preprocessing pipelines, e.g. to select tokens that occur sufficiently frequently, and create either random or smarter word embeddings for them.
FYI: Not sure if that helps but here is a basic example with tft: https://github.com/tensorflow/transform/blob/master/examples/sentiment_example.py