pycisTopic
pycisTopic copied to clipboard
Bug Report: OutOfMemoryError (“GC overhead limit exceeded”) when running topic modeling with Mallet
Bug Report: OutOfMemoryError (“GC overhead limit exceeded”) when running topic modeling with Mallet
Describe the bug
When invoking pycistopic topic-modeling … using the Mallet backend on a large corpus, the Java process crashes during train-topics with:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
RuntimeError: mallet train-topics returned non-zero exit status 1
This indicates that the JVM is spending almost all its time garbage-collecting and failing to make forward progress.
To Reproduce
-
Ensure Mallet is installed (e.g. version 2.0.8) and on your
PATH. -
Activate your
scenicplusconda env:conda activate scenicplus -
Run a topic modeling command on a large dataset, for example:
pycistopic topic_modeling mallet \
-i $input_file \
-o $output_file \
-t 10 \
-p $ncores \
-m $mem \
-b $mallet_path
- Observe the error in the STDERR log (
pycisTopic_all_*.err).
Error output
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at cc.mallet.topics.ParallelTopicModel.estimate(ParallelTopicModel.java:727)
at cc.mallet.topics.tui.TopicTrainer.main(TopicTrainer.java:245)
RuntimeError: command '['/lab-share/.../mallet', 'topic_modeling', ...]' returned non-zero exit status 1.
Expected behavior
The topic modeling should complete (or at least fail cleanly) without thrashing the JVM, given that a large amount of memory is specified. When insufficient memory was given it crashes with "out of heap memory" instead.
Screenshots
N/A
Version (please complete the following information):
- Python: 3.11.0 (Miniforge3)
- pycisTopic: 2.0a0 (from
pip show pycisTopic) - Mallet: 2.0.8
- OpenJDK: 11.x
The amount of memory used by Mallet can be controlled by the -m parameter.
$ pycistopic topic_modeling mallet run --help
usage: pycistopic topic_modeling mallet run [-h] -i MALLET_CORPUS_FILENAME -o OUTPUT_PREFIX -t TOPICS [TOPICS ...] -p PARALLEL [-n ITERATIONS] [-a ALPHA] [-A {True,False}] [-e ETA]
[-E {True,False}] [-s SEED] [-m MEMORY_IN_GB] [-b MALLET_PATH] [-v]
options:
-h, --help show this help message and exit
-i MALLET_CORPUS_FILENAME, --input MALLET_CORPUS_FILENAME
Mallet corpus filename.
-o OUTPUT_PREFIX, --output OUTPUT_PREFIX
Topic model output prefix.
-t TOPICS [TOPICS ...], --topics TOPICS [TOPICS ...]
Number(s) of topics to create during topic modeling.
-p PARALLEL, --parallel PARALLEL
Number of threads Mallet is allowed to use.
-n ITERATIONS, --iterations ITERATIONS
Number of iterations. Default: 150.
-a ALPHA, --alpha ALPHA
Alpha value. Default: 50.
-A {True,False}, --alpha_by_topic {True,False}
Whether the alpha value should by divided by the number of topics. Default: True.
-e ETA, --eta ETA Eta value. Default: 0.1.
-E {True,False}, --eta_by_topic {True,False}
Whether the eta value should by divided by the number of topics. Default: False.
-s SEED, --seed SEED Seed for ensuring reproducibility. Default: 555.
-m MEMORY_IN_GB, --memory MEMORY_IN_GB
Amount of memory (in GB) Mallet is allowed to use. Default: "100"
-b MALLET_PATH, --mallet_path MALLET_PATH
Path to Mallet binary (e.g. "/xxx/Mallet/bin/mallet"). Default: "mallet".
-v, --verbose Enable verbose mode.
How many regions and cells do you have and how much RAM does your node on which you run topic modeling have. With 100GB of RAM fairly large dataset can be run.
Hi @ghuls ,
Thanks for the reply. I did use -m to increase the RAM for mallet. I tried it on a 3000-cell dataset (25k regions) with 32Gb, 128Gb and then even 1TB but still gets the same GC error. Note that I used to get a different out of memory memory error (Java heap space) when I don't use -m to set the memory. So, this seems to me that some additional parameters need to be passed to Java/Mallet to control this behavior ..
@teng-gao how much memory does the machine have that you try to run Mallet on? 3000-cell dataset (25k regions) is a very small dataset. A few GB of RAM should be enough.
Can you provide the input file and the $ncores and $mem settings?
pycistopic topic_modeling mallet \
-i $input_file \
-o $output_file \
-t 10 \
-p $ncores \
-m $mem \
-b $mallet_path