pycisTopic icon indicating copy to clipboard operation
pycisTopic copied to clipboard

Bug Report: OutOfMemoryError (“GC overhead limit exceeded”) when running topic modeling with Mallet

Open teng-gao opened this issue 4 months ago • 3 comments

Bug Report: OutOfMemoryError (“GC overhead limit exceeded”) when running topic modeling with Mallet

Describe the bug

When invoking pycistopic topic-modeling … using the Mallet backend on a large corpus, the Java process crashes during train-topics with:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
RuntimeError: mallet train-topics returned non-zero exit status 1

This indicates that the JVM is spending almost all its time garbage-collecting and failing to make forward progress.

To Reproduce

  1. Ensure Mallet is installed (e.g. version 2.0.8) and on your PATH.

  2. Activate your scenicplus conda env:

    conda activate scenicplus
    
  3. Run a topic modeling command on a large dataset, for example:

pycistopic topic_modeling mallet \
      -i $input_file \
      -o $output_file \
      -t 10 \
      -p $ncores \
      -m $mem \
      -b $mallet_path
  1. Observe the error in the STDERR log (pycisTopic_all_*.err).

Error output

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at cc.mallet.topics.ParallelTopicModel.estimate(ParallelTopicModel.java:727)
    at cc.mallet.topics.tui.TopicTrainer.main(TopicTrainer.java:245)

RuntimeError: command '['/lab-share/.../mallet', 'topic_modeling', ...]' returned non-zero exit status 1.

Expected behavior

The topic modeling should complete (or at least fail cleanly) without thrashing the JVM, given that a large amount of memory is specified. When insufficient memory was given it crashes with "out of heap memory" instead.

Screenshots

N/A

Version (please complete the following information):

  • Python: 3.11.0 (Miniforge3)
  • pycisTopic: 2.0a0 (from pip show pycisTopic)
  • Mallet: 2.0.8
  • OpenJDK: 11.x

teng-gao avatar Jul 25 '25 21:07 teng-gao

The amount of memory used by Mallet can be controlled by the -m parameter.

$ pycistopic topic_modeling mallet run --help
usage: pycistopic topic_modeling mallet run [-h] -i MALLET_CORPUS_FILENAME -o OUTPUT_PREFIX -t TOPICS [TOPICS ...] -p PARALLEL [-n ITERATIONS] [-a ALPHA] [-A {True,False}] [-e ETA]
                                            [-E {True,False}] [-s SEED] [-m MEMORY_IN_GB] [-b MALLET_PATH] [-v]

options:
  -h, --help            show this help message and exit
  -i MALLET_CORPUS_FILENAME, --input MALLET_CORPUS_FILENAME
                        Mallet corpus filename.
  -o OUTPUT_PREFIX, --output OUTPUT_PREFIX
                        Topic model output prefix.
  -t TOPICS [TOPICS ...], --topics TOPICS [TOPICS ...]
                        Number(s) of topics to create during topic modeling.
  -p PARALLEL, --parallel PARALLEL
                        Number of threads Mallet is allowed to use.
  -n ITERATIONS, --iterations ITERATIONS
                        Number of iterations. Default: 150.
  -a ALPHA, --alpha ALPHA
                        Alpha value. Default: 50.
  -A {True,False}, --alpha_by_topic {True,False}
                        Whether the alpha value should by divided by the number of topics. Default: True.
  -e ETA, --eta ETA     Eta value. Default: 0.1.
  -E {True,False}, --eta_by_topic {True,False}
                        Whether the eta value should by divided by the number of topics. Default: False.
  -s SEED, --seed SEED  Seed for ensuring reproducibility. Default: 555.
  -m MEMORY_IN_GB, --memory MEMORY_IN_GB
                        Amount of memory (in GB) Mallet is allowed to use. Default: "100"
  -b MALLET_PATH, --mallet_path MALLET_PATH
                        Path to Mallet binary (e.g. "/xxx/Mallet/bin/mallet"). Default: "mallet".
  -v, --verbose         Enable verbose mode.

How many regions and cells do you have and how much RAM does your node on which you run topic modeling have. With 100GB of RAM fairly large dataset can be run.

ghuls avatar Jul 28 '25 10:07 ghuls

Hi @ghuls ,

Thanks for the reply. I did use -m to increase the RAM for mallet. I tried it on a 3000-cell dataset (25k regions) with 32Gb, 128Gb and then even 1TB but still gets the same GC error. Note that I used to get a different out of memory memory error (Java heap space) when I don't use -m to set the memory. So, this seems to me that some additional parameters need to be passed to Java/Mallet to control this behavior ..

teng-gao avatar Jul 28 '25 12:07 teng-gao

@teng-gao how much memory does the machine have that you try to run Mallet on? 3000-cell dataset (25k regions) is a very small dataset. A few GB of RAM should be enough.

Can you provide the input file and the $ncores and $mem settings?

pycistopic topic_modeling mallet \
      -i $input_file \
      -o $output_file \
      -t 10 \
      -p $ncores \
      -m $mem \
      -b $mallet_path

ghuls avatar Jul 30 '25 08:07 ghuls