pubtrends icon indicating copy to clipboard operation
pubtrends copied to clipboard

Division by zero

Open olegs opened this issue 2 years ago • 4 comments

To reproduce use predefined "brain computer interface" search from Pubmed.

[2021-10-14 08:34:35,747: INFO/ForkPoolWorker-1] Generating evolution topics descriptions
[2021-10-14 08:34:35,833: WARNING/ForkPoolWorker-1] /home/user/pysrc/papers/analysis/topics.py:116: RuntimeWarning: invalid value encountered in true_divide
  tokens_freqs_per_comp = tokens_freqs_per_comp / tokens_freqs_norm
[2021-10-14 08:34:35,833: WARNING/ForkPoolWorker-1] /home/user/pysrc/papers/analysis/topics.py:123: RuntimeWarning: divide by zero encountered in log
  adjusted_distance = distance.T * np.log(tokens_freqs_total)

olegs avatar Oct 14 '21 08:10 olegs

@ctrltz is it possible to use np.log1p to avoid this problem?

olegs avatar Oct 14 '21 08:10 olegs

Sure, but if tokens_freqs_total equals 0, I think it means that the whole corpus_counts contains only zeros, and one might also separate this case implicitly like:

if not corpus_counts.sum():
    return *empty descriptions here*

Did not keep evolution in mind when worked on the topics description, thanks for pointing it.

ctrltz avatar Oct 14 '21 09:10 ctrltz

Also tokens_freqs_norm may be zero. What is correct fix for this?

olegs avatar Oct 14 '21 09:10 olegs

As far as I understand, it means that some of the components have no corpus terms to be analyzed, so it would be correct to return an empty description for the respective components.

It might be simpler to plug in np.log1p at the moment to ensure stability, and I can think a bit more in the coming days.

NB: I have also fixed the previous comment in case you have used it already.

ctrltz avatar Oct 14 '21 10:10 ctrltz