pytorch-crf icon indicating copy to clipboard operation
pytorch-crf copied to clipboard

Tagging a sequence of sentences

Open preetraga opened this issue 6 years ago • 4 comments

Is it possible to use a sentence as the unit of labeling (tagging a sequence of sentences as opposed to words)? Would the code consider the embedding of each word and construct a sentence embedding?

preetraga avatar Jan 02 '19 20:01 preetraga

Hi @preetraga, though I haven't tested that, I'm pretty sure that would work without any changes to the code. You could just format the data file so that the first tab-separated item is a sentence instead of a token.

epwalsh avatar Jan 03 '19 18:01 epwalsh

@epwalsh thanks! I did try that with the sample data in the test folder. Although the code runs without errors, I see that it seems to be mapping anything longer than a single word to "unk" in the embedding in vocab.py. I tested this with hi and hello and tokens on different lines that get mapped to 1843 and 6328 in the word_idx lookup. However, if "hi hello" is the input sentence, it gets mapped to the "unk" id. So I am assuming that to process a sentence, one would have to read the each word's embedding in the sentence and combine them in some manner before creating the tensor?

preetraga avatar Jan 03 '19 18:01 preetraga

@preetraga I think you're right, at least if you're using the GloVe (or any other) pretrained embeddings

epwalsh avatar Jan 03 '19 20:01 epwalsh

On a side note, I'm not maintaining this repository anymore in favor of using AllenNLP instead. AllenNLP has a great CRF module, which is easy to use out-of-the-box. It also has various "seq2vec" encoders which you could use to create a representation of a sentence.

epwalsh avatar Jan 03 '19 20:01 epwalsh