guesslang icon indicating copy to clipboard operation
guesslang copied to clipboard

Feature engineering

Open fsx950223 opened this issue 2 years ago • 4 comments

I have a question about feature engineering. Why do you use chars as inputs instead of words? For example,

Hello world!
<tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'H', b'e', b'l', b'l', b'o', b' ', b'w', b'o', b'r', b'l', b'd',
       b'!'], dtype=object)>
ngrams: <tf.Tensor: shape=(11,), dtype=string, numpy=
array([b'H e', b'e l', b'l l', b'l o', b'o  ', b'  w', b'w o', b'o r',
       b'r l', b'l d', b'd !'], dtype=object)>

is better than

<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'Hello', b'world', b'!'], dtype=object)>
ngrams: <tf.Tensor: shape=(2,), dtype=string, numpy=array([b'Hello world', b'world !'], dtype=object)>

?

fsx950223 avatar Sep 10 '21 12:09 fsx950223

In order to use tflite model, you have to convert strings to token ids, such as 'He'-> 1332. @yoeo

fsx950223 avatar Sep 25 '21 08:09 fsx950223

Hi @fsx950223

Why do you use chars as inputs instead of words?

In fact, I tested both chars and words with various preprocessing tricks and chose the one that gave the best predictions with the current model & training dataset. If one day I switch to a new machine learning model or change the way I build the training dataset, I'll have to test the different preprocessing options again and choose the best one -> and it could be "words" this time.

By the way, if you know any general rule about when to use chars or words for feature engineering, I'll be happy to learn and test it :slightly_smiling_face:

yoeo avatar Sep 27 '21 22:09 yoeo

In order to use tflite model, you have to convert strings to token ids, such as 'He'-> 1332.

In theory yes. You probably could use tflite by:

  1. hacking the model trained model to take integer input instead of the string ones
  2. extract the string -> integer mappings from the model
  3. convert the hacked trained model (without the mappings) to tflite
  4. use the extracted mappings to convert your input strings into integer inputs
  5. send the integer inputs to the new tflite model to generate predictions

I don't know if it will actually work, but if you find a way to make work, please share the details here https://github.com/yoeo/guesslang/issues/26

yoeo avatar Sep 27 '21 23:09 yoeo

For improving model performance, I recommend tf.keras.layers.TextVectorization + FastText model which is similar to the current model. For more details, taking a look at https://www.tensorflow.org/text/guide/word_embeddings

fsx950223 avatar Sep 28 '21 12:09 fsx950223