PartiallyCollapsedLDA
PartiallyCollapsedLDA copied to clipboard
doc-topic distr.
Outputen sparad av "save_doc_theta_estimate = true" har fel dimensioner och uutputen visar inte heller proportioner utan counts.
Detta står i README.txt-filen:
Save the a file with document topic theta estimates (will not include zeros)
Unlike Phi means which are sampled with thinning, theta means is just a simple
average of the topic counts in the last iteration divided by the number of
tokens in the document thus there is not theta_burnin or theta_thinning
save_doc_theta_estimate = true doc_topic_theta_filename = doc_topic_theta.csv
Har en model med 200 ämnen men doc_theta_means filen har 400 kolumner och antal dokument som rader? Varför är antalet kolumner dubbla antalet ämnen i modellen?
Config-file:
configs = Spalias no_runs = 1
[Spalias] title = PCPLDA description = 200 topics with alpha 0.2 and extended priorlist dataset = data/fb_politics_news.txt scheme = spalias_priors seed = 1904 topics = 200 alpha = 0.2 beta = 0.01 iterations = 1500 rare_threshold = 0 batches = 4 topic_batches = 4 topic_interval = 500 start_diagnostic = 200 debug = 0 #log_type_topic_density = true log_document_density = true log_phi_density = true phi_mean_filename = phi-mean.csv phi_mean_burnin = 20 phi_mean_thin = 5 stoplist = nsc-test/PartiallyCollapsedLDA-8.4.0/stoplist-empty.txt save_vocabulary = true vocabulary_filename = lda_vocab.txt topic_prior_filename = wfw/bash/priors/k200_v7.txt keep_connecting_punctuation = true log_topic_indicators = true save_sampler = false save_doc_theta_estimate = true doc_topic_theta_filename = doc_topic_theta.csv save_phi_mean = true
Jag bifogar en bild av delar av outputen så du ser hur den ser ut.

The problem seems to stem from WriteASCIIDoubleMatrix. Decimal numbers are written with commas both as decimal separators and column separators. This adds an extra column for each printed value and every other column gets the value 0.
Yes, I noticed this bug also, and have a fix in 9.2.0, for parts of the problem, but will have to double check if this is also solved with that fix...
9.2.0 should solve this problem
The test for WriteASCIIDoubleMatrix now passes, but the problem unfortunately remains for me. It could maybe? be caused by the method formatDouble in LDAUtils.java:
String formatString = "%." + noDigits + "f";
return String.format(formatString, d);
since String.format() depends on defaultLocale (which for me is SE)
Yes, it is due to locale and it is a bit of a mess now unfortunately, the combination of Locale and possibility of selecting separator makes it complicated... I'll have a look and see if I can re-design to a better solution.