text build_vocab for custom data

❓ Questions and Help

Description Hello, I want to train custom data with lstm, the data looks like:

text, value a a a a b a ... 0.2 a b a a b a ... 0.24 .... b a a a b a ... 0.512

The value us probability and can be any float betweeb 0 to 1

TEXT.build_vocab(train_data) VALUE.build_vocab(<***>)

what should I insert <***> to make the VALUE be any float?

Thanks

Dec 10 '21 07:12 pberko

@pberko would you be able to clarify what you are asking here? If you want to build a vocab we expose 2 factory methods:

vocab() - creates a vocab object from an ordered_dict
build_vocab_from_iterator() - creates a vocab object from an iterator -> likely what you need

If you want to see usage examples of how to build the vocab from a file, take a look at the example code snippet in the doc string

The output of the vocab will just be integer indices mapping your tokens to values based on frequency. During the vocab building step, you won't be passing in your float value from the dataset. Instead you will be utlizing the value in your training step once you've finished all the preprocessing steps (tokenizing, building vocab, vectorization, etc).

Let me know if that helps answer your question!

Dec 10 '21 16:12 Nayef211

Hi @Nayef211,

Thanks for the answer, I'm afraid I did not clarify myself:

My data contains sentences in unknown language (just "a a a b b ...") and probability 0-1 to each sentence. I want to train the data.

Since the value is probability I thought it must be any float and not just the probabiliy apeared in the train set file.

Is it possible to use build_vocab_from_iterator() to any value between 0-1?

Thanks

Dec 11 '21 16:12 pberko

Hi @pberko I am not why you want to add sentence probability to the vocabulary. Could you please provide additional details on what you are exactly trying to build?

Dec 13 '21 22:12 parmeet

Hello @parmeet I have a machine which prints output made of few characters e.g. only a/b/c

the output look like "a b b b ..." few times the machine ends with failure and I want to predict when the output starts with "a b b" what is the probability to failure,

For that reason I want to train a model to predict for an input the probability for a failure. Thanks

Dec 14 '21 07:12 pberko

Thanks @pberko for additional details here. I think you might be mixing labels with your input. If I understand correctly, your data is a collection of sequence of tokens "a b b b ..." and associated probabilities (float values between 0 and 1) of failures. Is that right?

If so, as far as vocabulary is concerned, you could simply add all the possible unique tokens that could occur in the sequence (a, b, c, etc). The probability is essentially your label (something you want to predict) which need not to be part of the vocabulary right?

Dec 16 '21 23:12 parmeet

exeactly @parmeet do you have an example how to train a LSTM for such a problem?

Thanks

Dec 17 '21 04:12 pberko

text text copied to clipboard

build_vocab for custom data

❓ Questions and Help

text
text copied to clipboard