PartiallyCollapsedLDA icon indicating copy to clipboard operation
PartiallyCollapsedLDA copied to clipboard

State files, output format

Open rebeckahw opened this issue 2 years ago • 12 comments

Is it possible to get state output formatted similarly to the state files produced by Mallet from PCLDA? I.e. with the columns: doc source pos typeindex type topic

The same state information can be recreated from the z_.csv files in combination with the corpus and vocabulary files, but it would be a nice-to-have

rebeckahw avatar Oct 07 '22 13:10 rebeckahw

I'll have a look. But I can't promise a quick turn around time I'm afraid... Life keeps getting in the way nowadays! :)

lejon avatar Oct 15 '22 09:10 lejon

I have now a 9.2.0 release with (hopefully) this supported, would be glad if you could test it and verify that it works as expected.

lejon avatar Oct 16 '22 13:10 lejon

Thank you very much! It works as expected with one small exception. With mallet, I get a column named source as the second column. The contents of the column is not really important, but it perhaps affects what is expected to find in the remaining columns if they are parsed based on column order.

rebeckahw avatar Oct 19 '22 09:10 rebeckahw

Ok, thanks for that feedback. I must have been looking at some old spec. I'll have a look.

lejon avatar Oct 19 '22 14:10 lejon

Hi, sorry for the delay on this. I have now checked the MALLET code and as far as I can see, I'm using the same format. Here is the relevant MALLET code:

public void printState (PrintWriter pw)
  {
	  Alphabet a = ilist.getDataAlphabet();
	  pw.println ("#doc pos typeindex type topic");
	  for (int di = 0; di < topics.length; di++) {
		  FeatureSequence fs = (FeatureSequence) ilist.get(di).getData();
		  for (int si = 0; si < topics[di].length; si++) {
			  int type = fs.getIndexAtPosition(si);
			  pw.print(di); pw.print(' ');
			  pw.print(si); pw.print(' ');
			  pw.print(type); pw.print(' ');
			  pw.print(a.lookupObject(type)); pw.print(' ');
			  pw.print(topics[di][si]); pw.println();
		  }
	  }
  }

lejon avatar Nov 06 '22 11:11 lejon

Hmm, so I was tricked by different implementations of printState in different samplers, so the ParallelTopicModel implements print state in another way which does indeed get the source...

Will use the version with source here also.

lejon avatar Nov 06 '22 11:11 lejon

It seems that in MALLET almost always the source will be NA... What is the expected value of source?

lejon avatar Nov 06 '22 11:11 lejon

Have a 9.2.1 version that adds source, but it will basically always be NA. If this field is used I can add the proper info there, but I'm a bit unclear what it is expected to contain.

lejon avatar Nov 06 '22 12:11 lejon

Just having the "NA" will be very helpful. (It seems like this field could perhaps be used to preserve some extra information about the input. It is commented as /* The input in a reproducable form, e.g. enabling re-print of string w/ POS tags, usually without target information, e.g. an un-annotated RegionList. */)

rebeckahw avatar Nov 08 '22 07:11 rebeckahw

Hi! Måns here. How do we change the config file to get the state file from a PCLDA run?

liamtabib avatar Jun 08 '23 14:06 liamtabib

If you mean the topic indicators (Z)…PartiallyCollapsedLDA/Configuration-README.md at master · lejon/PartiallyCollapsedLDAgithub.comCheers,-LeifOn 8 Jun 2023, at 16:38, Liam Tabibzadeh @.***> wrote: Hi! Måns here. How do we change the config file to get the state file from a PCLDA run?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

lejon avatar Jun 08 '23 14:06 lejon

Hi! No, the state files (that you discuss above)?

MansMeg avatar Jun 08 '23 15:06 MansMeg