pytorch-crf
pytorch-crf copied to clipboard
Tagging a sequence of sentences
Is it possible to use a sentence as the unit of labeling (tagging a sequence of sentences as opposed to words)? Would the code consider the embedding of each word and construct a sentence embedding?
Hi @preetraga, though I haven't tested that, I'm pretty sure that would work without any changes to the code. You could just format the data file so that the first tab-separated item is a sentence instead of a token.
@epwalsh thanks! I did try that with the sample data in the test folder. Although the code runs without errors, I see that it seems to be mapping anything longer than a single word to "unk" in the embedding in vocab.py. I tested this with hi and hello and tokens on different lines that get mapped to 1843 and 6328 in the word_idx lookup. However, if "hi hello" is the input sentence, it gets mapped to the "unk" id. So I am assuming that to process a sentence, one would have to read the each word's embedding in the sentence and combine them in some manner before creating the tensor?
@preetraga I think you're right, at least if you're using the GloVe (or any other) pretrained embeddings
On a side note, I'm not maintaining this repository anymore in favor of using AllenNLP instead. AllenNLP has a great CRF module, which is easy to use out-of-the-box. It also has various "seq2vec" encoders which you could use to create a representation of a sentence.