text icon indicating copy to clipboard operation
text copied to clipboard

build_vocab for custom data

Open pberko opened this issue 4 years ago • 6 comments

❓ Questions and Help

Description Hello, I want to train custom data with lstm, the data looks like:

text, value a a a a b a ... 0.2 a b a a b a ... 0.24 .... b a a a b a ... 0.512

The value us probability and can be any float betweeb 0 to 1

TEXT.build_vocab(train_data) VALUE.build_vocab(<***>)

what should I insert <***> to make the VALUE be any float?

Thanks

pberko avatar Dec 10 '21 07:12 pberko

@pberko would you be able to clarify what you are asking here? If you want to build a vocab we expose 2 factory methods:

  • vocab() - creates a vocab object from an ordered_dict
  • build_vocab_from_iterator() - creates a vocab object from an iterator -> likely what you need

If you want to see usage examples of how to build the vocab from a file, take a look at the example code snippet in the doc string

The output of the vocab will just be integer indices mapping your tokens to values based on frequency. During the vocab building step, you won't be passing in your float value from the dataset. Instead you will be utlizing the value in your training step once you've finished all the preprocessing steps (tokenizing, building vocab, vectorization, etc).

Let me know if that helps answer your question!

Nayef211 avatar Dec 10 '21 16:12 Nayef211

Hi @Nayef211,

Thanks for the answer, I'm afraid I did not clarify myself:

My data contains sentences in unknown language (just "a a a b b ...") and probability 0-1 to each sentence. I want to train the data.

Since the value is probability I thought it must be any float and not just the probabiliy apeared in the train set file.

Is it possible to use build_vocab_from_iterator() to any value between 0-1?

Thanks

pberko avatar Dec 11 '21 16:12 pberko

Hi @pberko I am not why you want to add sentence probability to the vocabulary. Could you please provide additional details on what you are exactly trying to build?

parmeet avatar Dec 13 '21 22:12 parmeet

Hello @parmeet I have a machine which prints output made of few characters e.g. only a/b/c

the output look like "a b b b ..." few times the machine ends with failure and I want to predict when the output starts with "a b b" what is the probability to failure,

For that reason I want to train a model to predict for an input the probability for a failure. Thanks

pberko avatar Dec 14 '21 07:12 pberko

Thanks @pberko for additional details here. I think you might be mixing labels with your input. If I understand correctly, your data is a collection of sequence of tokens "a b b b ..." and associated probabilities (float values between 0 and 1) of failures. Is that right?

If so, as far as vocabulary is concerned, you could simply add all the possible unique tokens that could occur in the sequence (a, b, c, etc). The probability is essentially your label (something you want to predict) which need not to be part of the vocabulary right?

parmeet avatar Dec 16 '21 23:12 parmeet

exeactly @parmeet do you have an example how to train a LSTM for such a problem?

Thanks

pberko avatar Dec 17 '21 04:12 pberko