text
text copied to clipboard
build_vocab for custom data
❓ Questions and Help
Description Hello, I want to train custom data with lstm, the data looks like:
text, value a a a a b a ... 0.2 a b a a b a ... 0.24 .... b a a a b a ... 0.512
The value us probability and can be any float betweeb 0 to 1
TEXT.build_vocab(train_data) VALUE.build_vocab(<***>)
what should I insert <***> to make the VALUE be any float?
Thanks
@pberko would you be able to clarify what you are asking here? If you want to build a vocab we expose 2 factory methods:
vocab()- creates a vocab object from anordered_dictbuild_vocab_from_iterator()- creates a vocab object from an iterator -> likely what you need
If you want to see usage examples of how to build the vocab from a file, take a look at the example code snippet in the doc string
The output of the vocab will just be integer indices mapping your tokens to values based on frequency. During the vocab building step, you won't be passing in your float value from the dataset. Instead you will be utlizing the value in your training step once you've finished all the preprocessing steps (tokenizing, building vocab, vectorization, etc).
Let me know if that helps answer your question!
Hi @Nayef211,
Thanks for the answer, I'm afraid I did not clarify myself:
My data contains sentences in unknown language (just "a a a b b ...") and probability 0-1 to each sentence. I want to train the data.
Since the value is probability I thought it must be any float and not just the probabiliy apeared in the train set file.
Is it possible to use build_vocab_from_iterator() to any value between 0-1?
Thanks
Hi @pberko I am not why you want to add sentence probability to the vocabulary. Could you please provide additional details on what you are exactly trying to build?
Hello @parmeet I have a machine which prints output made of few characters e.g. only a/b/c
the output look like "a b b b ..." few times the machine ends with failure and I want to predict when the output starts with "a b b" what is the probability to failure,
For that reason I want to train a model to predict for an input the probability for a failure. Thanks
Thanks @pberko for additional details here. I think you might be mixing labels with your input. If I understand correctly, your data is a collection of sequence of tokens "a b b b ..." and associated probabilities (float values between 0 and 1) of failures. Is that right?
If so, as far as vocabulary is concerned, you could simply add all the possible unique tokens that could occur in the sequence (a, b, c, etc). The probability is essentially your label (something you want to predict) which need not to be part of the vocabulary right?
exeactly @parmeet do you have an example how to train a LSTM for such a problem?
Thanks