Top2Vec Segfaults with "high" number of documents (>57.000)

Top2Vec segfaults at a "high" number of documents. I have no idea why.

I am running deep-learn and universal-sentence-encoder. My texts are 20K letters each at most (does this have a huge influence?)

What can i do to mitigate this? Another issue mentioned using less documents for training, and adding the others later on (untrained, just classified, for lookup purposes), but the issue mentioned a training dataset of 500K not 50K.

That's the output where it breaks:

2021-10-18 21:53:20,033 - top2vec - INFO - Pre-processing documents for training
/home/sascha/project/.venv/lib/python3.7/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
2021-10-18 21:54:59,567 - top2vec - INFO - Downloading universal-sentence-encoder model
INFO:absl:Using /tmp/tfhub_modules to cache modules.
2021-10-18 21:54:59.697664: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-10-18 21:54:59.734079: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-18 21:54:59.734467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:2d:00.0 name: NVIDIA GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.8GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2021-10-18 21:54:59.734654: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-10-18 21:54:59.735667: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-10-18 21:54:59.736811: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-10-18 21:54:59.736998: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-10-18 21:54:59.738008: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-10-18 21:54:59.738588: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2021-10-18 21:54:59.740711: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-10-18 21:54:59.740830: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-18 21:54:59.741236: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-18 21:54:59.741559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2021-10-18 21:54:59.741800: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-10-18 21:54:59.760277: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3801525000 Hz
2021-10-18 21:54:59.761029: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f2a6c000b60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-10-18 21:54:59.761046: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-10-18 21:54:59.856525: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-18 21:54:59.856950: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b3a2f274b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-10-18 21:54:59.856972: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 2070 SUPER, Compute Capability 7.5
2021-10-18 21:54:59.857192: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-18 21:54:59.857744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:2d:00.0 name: NVIDIA GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.8GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2021-10-18 21:54:59.857820: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-10-18 21:54:59.857847: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-10-18 21:54:59.857870: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-10-18 21:54:59.857893: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-10-18 21:54:59.857916: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-10-18 21:54:59.857938: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2021-10-18 21:54:59.857961: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-10-18 21:54:59.858054: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-18 21:54:59.858643: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-18 21:54:59.859154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2021-10-18 21:54:59.859197: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-10-18 21:54:59.859901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-10-18 21:54:59.859916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2021-10-18 21:54:59.859924: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2021-10-18 21:54:59.860061: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-18 21:54:59.860670: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-18 21:54:59.861115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7224 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2070 SUPER, pci bus id: 0000:2d:00.0, compute capability: 7.5)
2021-10-18 21:55:03,509 - top2vec - INFO - Creating joint document/word embedding
INFO:top2vec:Creating joint document/word embedding
2021-10-18 21:55:03.686753: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-10-18 21:56:55,599 - top2vec - INFO - Creating lower dimension embedding of documents
INFO:top2vec:Creating lower dimension embedding of documents
zsh: segmentation fault (core dumped)  python

And here's the versions of libs i'm using. (All other combinations are incompatible according to pip)

AMD Ryzen 9 3900X 12-Core Processor (x86_64)
NVIDIA GeForce RTX 2070S
64 GB Ram

Arch Linux 5.14.11-arch1-1 x86_64

Python 3.7.12

        "numpy==1.20.0",
        "tensorflow==2.2.0",
        "tensorflow-text==2.2.1",
        "nltk>=3.6.5",
        "top2vec[sentence_encoders,sentence_transformers]==1.0.26",
        "scipy>=1.4.1",
        "sklearn>=0.0",

Oct 18 '21 20:10 alexd2580-sf

Another interesting case is attempting to train a model with the same params twice without restarting python in between. In this case it throws some internal error the first time, followed by a segfault in the second attempt.

>>> train(speed="fast-learn", embedding_model="doc2vec")
2021-10-19 14:50:00,788 - top2vec - INFO - Pre-processing documents for training
2021-10-19 14:50:14,117 - top2vec - INFO - Creating joint document/word embedding
2021-10-19 14:57:10,169 - top2vec - INFO - Creating lower dimension embedding of documents
2021-10-19 14:58:07,182 - top2vec - INFO - Finding dense areas of documents
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/sascha/project/.venv/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 407, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.__setstate__
  File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews
  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sascha/project/packages/ml/ml/top2vec_model.py", line 125, in create
    speed=self.kwargs["speed"],
  File "/home/sascha/project/packages/ml/ml/top2vec.py", line 370, in __init__
    cluster = hdbscan.HDBSCAN(**hdbscan_args).fit(umap_model.embedding_)
  File "/home/sascha/project/.venv/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 919, in fit
    self._min_spanning_tree) = hdbscan(X, **kwargs)
  File "/home/sascha/project/.venv/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 615, in hdbscan
    core_dist_n_jobs, **kwargs)
  File "/home/sascha/project/.venv/lib/python3.7/site-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/home/sascha/project/.venv/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 278, in _hdbscan_boruvka_kdtree
    n_jobs=core_dist_n_jobs, **kwargs)
  File "hdbscan/_hdbscan_boruvka.pyx", line 392, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__
  File "hdbscan/_hdbscan_boruvka.pyx", line 426, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds
  File "/home/sascha/project/.venv/lib/python3.7/site-packages/joblib/parallel.py", line 1056, in __call__
    self.retrieve()
  File "/home/sascha/project/.venv/lib/python3.7/site-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/sascha/project/.venv/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
>>>
>>>
>>> train(speed="fast-learn", embedding_model="doc2vec")
2021-10-19 15:25:49,998 - top2vec - INFO - Pre-processing documents for training
2021-10-19 15:26:03,079 - top2vec - INFO - Creating joint document/word embedding
2021-10-19 15:32:52,837 - top2vec - INFO - Creating lower dimension embedding of documents
zsh: segmentation fault (core dumped)  poetry run python
/home/sascha/project/.venv/lib/python3.7/site-packages/joblib/externals/loky/backend/resource_tracker.py:320: UserWarning: resource_tracker: There appear to be 1 leaked folder objects to clean up at shutdown
  (len(rtype_registry), rtype))

Oct 19 '21 13:10 alexd2580-sf

Hi,@alexd2580-sf. I got a same problem as you ,did you solve it?

Nov 10 '21 01:11 zhaoyi1025

Nope, not at all. Though interestingly enough this time when i ran the same code it worked for a 112K documents dataset. May be because this time the GPU-support failed to initialize (after hibernation sometimes the GPU comes back "~weirdly~"). The GPU is still online though, recognized by nvtop and such...

The initialization looks like this then:

2021-11-16 17:46:20,703 - top2vec - INFO - Downloading universal-sentence-encoder model
INFO:absl:Using /tmp/tfhub_modules to cache modules.
2021-11-16 17:46:20.841470: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-11-16 17:46:20.860213: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2021-11-16 17:46:20.860235: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: monika
2021-11-16 17:46:20.860242: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: monika
2021-11-16 17:46:20.860336: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 495.44.0
2021-11-16 17:46:20.860354: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 495.44.0
2021-11-16 17:46:20.860360: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 495.44.0
2021-11-16 17:46:20.860545: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-11-16 17:46:20.880134: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3801815000 Hz
2021-11-16 17:46:20.880815: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f22c8000b60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-16 17:46:20.880833: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version

So in summary:

using universal-sentence-encoder model
Documents with 20K letters at most.
With the GPU it trains with up to 50K documents, then starts breaking, segfaulting, etc. see OP.
When GPU initialization fails it works with up to ~112K documents. Same errors with larger datasets....
Sometimes running the training multiple times consecutively (just calling train() over and over again) yields different results....

Going to try running this on the beefiest machine on AWS...

Nov 16 '21 18:11 alexd2580-sf

I've also ran into this.

It depends what your problem is and what are you trying to acheve/implement - but in my case splitting the data set in two, which resulted in two models, kinda patched things.

Dec 13 '21 00:12 ViksaaSkool

This is somewhat related to my question here: https://github.com/ddangelov/Top2Vec/issues/242, and I have encountered similar problem. It seems that when using GPU, more memory is needed for the same size of data. I am using an HPC server for this and here is what I have tried

using a Tesla 100 with 32GB GPU memory
using cpu only with 32GB memory

On the same dataset, (1) ended up with segfault, potentially indicating insufficient memory. (2) finished successfully.

I don't know why, but it seems that I have to give up the speed of a GPU because it requires more memory

Feb 09 '22 20:02 ziqizhang

The new top2vec version 1.0.27 has some memory bugfixes as well as an option for document chunking for long documents which could help solve some of these issue. There is also a embedding_batch_size parameter.

Apr 03 '22 22:04 ddangelov

Top2Vec Top2Vec copied to clipboard

Segfaults with "high" number of documents (>57.000)

Top2Vec
Top2Vec copied to clipboard