PartiallyCollapsedLDA
PartiallyCollapsedLDA copied to clipboard
State files, output format
Is it possible to get state output formatted similarly to the state files produced by Mallet from PCLDA? I.e. with the columns: doc source pos typeindex type topic
The same state information can be recreated from the z_
I'll have a look. But I can't promise a quick turn around time I'm afraid... Life keeps getting in the way nowadays! :)
I have now a 9.2.0 release with (hopefully) this supported, would be glad if you could test it and verify that it works as expected.
Thank you very much! It works as expected with one small exception. With mallet, I get a column named source as the second column. The contents of the column is not really important, but it perhaps affects what is expected to find in the remaining columns if they are parsed based on column order.
Ok, thanks for that feedback. I must have been looking at some old spec. I'll have a look.
Hi, sorry for the delay on this. I have now checked the MALLET code and as far as I can see, I'm using the same format. Here is the relevant MALLET code:
public void printState (PrintWriter pw)
{
Alphabet a = ilist.getDataAlphabet();
pw.println ("#doc pos typeindex type topic");
for (int di = 0; di < topics.length; di++) {
FeatureSequence fs = (FeatureSequence) ilist.get(di).getData();
for (int si = 0; si < topics[di].length; si++) {
int type = fs.getIndexAtPosition(si);
pw.print(di); pw.print(' ');
pw.print(si); pw.print(' ');
pw.print(type); pw.print(' ');
pw.print(a.lookupObject(type)); pw.print(' ');
pw.print(topics[di][si]); pw.println();
}
}
}
Hmm, so I was tricked by different implementations of printState in different samplers, so the ParallelTopicModel implements print state in another way which does indeed get the source...
Will use the version with source here also.
It seems that in MALLET almost always the source will be NA... What is the expected value of source?
Have a 9.2.1 version that adds source, but it will basically always be NA. If this field is used I can add the proper info there, but I'm a bit unclear what it is expected to contain.
Just having the "NA" will be very helpful. (It seems like this field could perhaps be used to preserve some extra information about the input. It is commented as /* The input in a reproducable form, e.g. enabling re-print of string w/ POS tags, usually without target information, e.g. an un-annotated RegionList. */)
Hi! Måns here. How do we change the config file to get the state file from a PCLDA run?
If you mean the topic indicators (Z)…PartiallyCollapsedLDA/Configuration-README.md at master · lejon/PartiallyCollapsedLDAgithub.comCheers,-LeifOn 8 Jun 2023, at 16:38, Liam Tabibzadeh @.***> wrote: Hi! Måns here. How do we change the config file to get the state file from a PCLDA run?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>
Hi! No, the state files (that you discuss above)?