tomotopy icon indicating copy to clipboard operation
tomotopy copied to clipboard

Problems with empty uid

Open bab2min opened this issue 2 years ago • 3 comments

@Jurgita-DS Aha, you created without uid param. I'll check it. Thank you!

I am having an issue when training an LDA model, I get 'uid' values of '' for all documents. I also don't see any option to provide document ids to the Corpus as you mention here. Is there the capability of including user defined document ids?

Originally posted by @MarkWClements in https://github.com/bab2min/tomotopy/issues/62#issuecomment-909785385

bab2min avatar Sep 03 '21 12:09 bab2min

@MarkWClements You can provide uid as optional argument to Corpus.add_doc like followings:

corpus = tp.utils.Corpus()
corpus.add_doc(some_words, uid="doc1")
corpus.add_doc(some_words, uid="doc2")
corpus.add_doc(some_words, uid="doc3")

I'll supplement the documentation about this.

bab2min avatar Sep 03 '21 12:09 bab2min

Is there a way to add a uid after the model is already trained to the existing documents in the trained model or do I have to re-train the model with this feature. Also, do the documents persist in the same order in which they are fed into this:

corpus = tp.utils.Corpus()
for doc in docs:
        corpus.add_doc(words=doc)

That is when I call

trained_docs = lda.docs

is trained_docs[n] the same document as docs[n]? I can manually add labels later if this is the case, I just want to make sure the document order is preserved in training the model.

Thanks

MarkWClements-zz avatar Oct 19 '21 01:10 MarkWClements-zz

Hi @MarkWClements

  1. Currently, there is no feature about modifying uid. I'll add it to future development features.

  2. Usually, trained_docs[n] is the same document as docs[n], except a few case where corpus has unsupported documents (e.g. documents with no word). You can check it by test their length: len(trained_docs) == len(docs). If len(trained_docs) is different from len(docs), it means there are some errors in pushing documents of docs into lda model and some of them are missing.

In current version, errors or warnings related to inserting corpus into models are not clearly displayed, but I will improve it later patch.

bab2min avatar Oct 20 '21 15:10 bab2min