PartiallyCollapsedLDA icon indicating copy to clipboard operation
PartiallyCollapsedLDA copied to clipboard

doc-topic distr.

Open mhbodell opened this issue 3 years ago • 5 comments

Outputen sparad av "save_doc_theta_estimate = true" har fel dimensioner och uutputen visar inte heller proportioner utan counts.

Detta står i README.txt-filen:

Save the a file with document topic theta estimates (will not include zeros)

Unlike Phi means which are sampled with thinning, theta means is just a simple

average of the topic counts in the last iteration divided by the number of

tokens in the document thus there is not theta_burnin or theta_thinning

save_doc_theta_estimate = true doc_topic_theta_filename = doc_topic_theta.csv

Har en model med 200 ämnen men doc_theta_means filen har 400 kolumner och antal dokument som rader? Varför är antalet kolumner dubbla antalet ämnen i modellen?

Config-file:

configs = Spalias no_runs = 1

[Spalias] title = PCPLDA description = 200 topics with alpha 0.2 and extended priorlist dataset = data/fb_politics_news.txt scheme = spalias_priors seed = 1904 topics = 200 alpha = 0.2 beta = 0.01 iterations = 1500 rare_threshold = 0 batches = 4 topic_batches = 4 topic_interval = 500 start_diagnostic = 200 debug = 0 #log_type_topic_density = true log_document_density = true log_phi_density = true phi_mean_filename = phi-mean.csv phi_mean_burnin = 20 phi_mean_thin = 5 stoplist = nsc-test/PartiallyCollapsedLDA-8.4.0/stoplist-empty.txt save_vocabulary = true vocabulary_filename = lda_vocab.txt topic_prior_filename = wfw/bash/priors/k200_v7.txt keep_connecting_punctuation = true log_topic_indicators = true save_sampler = false save_doc_theta_estimate = true doc_topic_theta_filename = doc_topic_theta.csv save_phi_mean = true

Jag bifogar en bild av delar av outputen så du ser hur den ser ut.

Screen Shot 2021-04-20 at 10 06 37

mhbodell avatar Apr 20 '21 08:04 mhbodell

The problem seems to stem from WriteASCIIDoubleMatrix. Decimal numbers are written with commas both as decimal separators and column separators. This adds an extra column for each printed value and every other column gets the value 0.

rebeckahw avatar Oct 07 '22 13:10 rebeckahw

Yes, I noticed this bug also, and have a fix in 9.2.0, for parts of the problem, but will have to double check if this is also solved with that fix...

lejon avatar Oct 16 '22 14:10 lejon

9.2.0 should solve this problem

lejon avatar Oct 16 '22 14:10 lejon

The test for WriteASCIIDoubleMatrix now passes, but the problem unfortunately remains for me. It could maybe? be caused by the method formatDouble in LDAUtils.java:

		String formatString = "%." + noDigits + "f";
		return String.format(formatString, d);

since String.format() depends on defaultLocale (which for me is SE)

rebeckahw avatar Oct 19 '22 09:10 rebeckahw

Yes, it is due to locale and it is a bit of a mess now unfortunately, the combination of Locale and possibility of selecting separator makes it complicated... I'll have a look and see if I can re-design to a better solution.

lejon avatar Oct 19 '22 14:10 lejon