tomotopy icon indicating copy to clipboard operation
tomotopy copied to clipboard

Sample-lag Specification for Gibbs Sampling

Open MarkWClements-zz opened this issue 3 years ago • 1 comments

Hello,

Thank you for putting together this awesome package. I have one question regarding how the Gibbs Sampling is done under the hood. In other applications of Gibbs Sampling there is an option to specify the sample-lag. Averaging over samples to estimate the target distribution requires i.i.d. samples. However, future samples depend on the current samples (i.e., the Markov assumption). To avoid autocorrelation, we should discard all but every L samples. Is this feature baked into the train() method? I see no way to specify this in the code.

Thanks!

Mark

MarkWClements-zz avatar Aug 18 '21 00:08 MarkWClements-zz

Hello @MarkWClements The current version of tomotopy uses only a sample from the latest model state. Although this often leads to inaccurate estimation, this method was adopted because of its low memory usage and low computational load.

I understand that there are situations where accuracy is more important than speed or efficiency. So I'll consider adding a new feature, averaging over samples with sample-lag. Since the current implementation is optimized to contain only one sample, changing this implementation may take a long time.

Or you can mimic averaging over samples on the Python side.

burn_in_steps = 100
sample_lag = 10
total_steps = 1000
averaging_size = 5

stored_models = []

model.train(burn_in_steps)

for i in range(burn_in_steps, total_steps, sample_lag):
    model.train(sample_lag)
    stored_models.append(model.copy())
    if len(stored_models) > averaging_size: stored_models.pop(0)

# now you have 5 recent models with sample-lag at `stored_models`

average_topic_word_dist = np.mean([m.get_topic_word_dist(topic_id=0) for m in stored_models], axis=0)
# you can obtain final distribution by averaging the distribution from each model in `stored_models` manually.

(Obviously this Python code is inefficient because it should have 5 copies of the whole model and perform deep-copy at every sample_lag step. But this seems to be the only way.)

bab2min avatar Aug 18 '21 16:08 bab2min