tomotopy
tomotopy copied to clipboard
Problems with empty uid
@Jurgita-DS Aha, you created without
uid
param. I'll check it. Thank you!
I am having an issue when training an LDA model, I get 'uid' values of ''
for all documents. I also don't see any option to provide document ids to the Corpus as you mention here. Is there the capability of including user defined document ids?
Originally posted by @MarkWClements in https://github.com/bab2min/tomotopy/issues/62#issuecomment-909785385
@MarkWClements
You can provide uid
as optional argument to Corpus.add_doc
like followings:
corpus = tp.utils.Corpus()
corpus.add_doc(some_words, uid="doc1")
corpus.add_doc(some_words, uid="doc2")
corpus.add_doc(some_words, uid="doc3")
I'll supplement the documentation about this.
Is there a way to add a uid after the model is already trained to the existing documents in the trained model or do I have to re-train the model with this feature. Also, do the documents persist in the same order in which they are fed into this:
corpus = tp.utils.Corpus()
for doc in docs:
corpus.add_doc(words=doc)
That is when I call
trained_docs = lda.docs
is trained_docs[n]
the same document as docs[n]
? I can manually add labels later if this is the case, I just want to make sure the document order is preserved in training the model.
Thanks
Hi @MarkWClements
-
Currently, there is no feature about modifying
uid
. I'll add it to future development features. -
Usually,
trained_docs[n]
is the same document asdocs[n]
, except a few case wherecorpus
has unsupported documents (e.g. documents with no word). You can check it by test their length:len(trained_docs) == len(docs)
. Iflen(trained_docs)
is different fromlen(docs)
, it means there are some errors in pushing documents ofdocs
intolda
model and some of them are missing.
In current version, errors or warnings related to inserting corpus into models are not clearly displayed, but I will improve it later patch.