dance
dance copied to clipboard
Random `segfault` when running pipeline tuning with `wandb` on certain machines
When running the tuning examples recently introduced in #398 (and #406), there appears to be some random chance of having segfault
. The issue was later observed to be machine specific. I have only been getting this random segfault
on MSU ICER HPCC (Python 3.8.16). Running the same example script on papermachine
does not throw this segfault
.
Looking at the core dump file (using pystack), it appears that the issue was related to Python's threading
. More particularly when calling sklearn's randomized svd func (maybe some other similar packages as well). See detail core dump log below.
(dance) bash-4.2$ pystack core core.113134
Using executable found in the core file: /mnt/home/liurenmi/software/anaconda3/envs/dance/bin/python
Core file information:
state: D zombie: True niceness: 0
pid: 113134 ppid: 112816 sid: 112816
uid: 790872 gid: 2362 pgrp: 113134
executable: python arguments: python main.py
The process died due a segmentation fault accessing address: 0xffffffffffffff70
Traceback for thread 114928 [] (most recent call last):
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 300, in check_internal_messages
self._loop_check_status(
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 251, in _loop_check_status
join_requested = self._join_event.wait(timeout=wait_time)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 558, in wait
signaled = self._cond.wait(timeout)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
Traceback for thread 114927 [] (most recent call last):
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 268, in check_network_status
self._loop_check_status(
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 251, in _loop_check_status
join_requested = self._join_event.wait(timeout=wait_time)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 558, in wait
signaled = self._cond.wait(timeout)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
Traceback for thread 114926 [] (most recent call last):
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 286, in check_stop_status
self._loop_check_status(
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 251, in _loop_check_status
join_requested = self._join_event.wait(timeout=wait_time)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 558, in wait
signaled = self._cond.wait(timeout)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
Traceback for thread 114874 [] (most recent call last):
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/interface/router.py", line 70, in message_loop
msg = self._read_message()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/interface/router_sock.py", line 27, in _read_message
resp = self._sock_client.read_server_response(timeout=1)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 285, in read_server_response
data = self._read_packet_bytes(timeout=timeout)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 255, in _read_packet_bytes
data = self._sock.recv(self._bufsize)
Traceback for thread 114845 [Has the GIL] (most recent call last):
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 298, in _run_job
self._function()
(Python) File "main.py", line 83, in evaluate_pipeline
preprocessing_pipeline(data)
(Python) File "/mnt/ufs18/home-026/liurenmi/repo/dance/dance/pipeline.py", line 238, in __call__
func(*args, **kwargs)
(Python) File "/mnt/ufs18/home-026/liurenmi/repo/dance/dance/transforms/cell_feature.py", line 56, in __call__
gene_feat = gene_pca.fit_transform(feat.T) # decompose into gene features
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 157, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/base.py", line 1152, in wrapper
return fit_method(estimator, *args, **kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/decomposition/_pca.py", line 460, in fit_transform
U, S, Vt = self._fit(X)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/decomposition/_pca.py", line 512, in _fit
return self._fit_truncated(X, n_components, self._fit_svd_solver)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/decomposition/_pca.py", line 616, in _fit_truncated
U, S, Vt = randomized_svd(
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/utils/extmath.py", line 449, in randomized_svd
Q = randomized_range_finder(
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/sklearn/utils/extmath.py", line 277, in randomized_range_finder
Q, _ = linalg.lu(safe_sparse_dot(A, Q), permute_l=True)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/scipy/linalg/_decomp_lu.py", line 220, in lu
p, l, u, info = flu(a1, permute_l=permute_l, overwrite_a=overwrite_a)
Traceback for thread 114844 [] (most recent call last):
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 178, in _heartbeat
time.sleep(5)
Traceback for thread 113134 [] (most recent call last):
(Python) File "main.py", line 108, in <module>
wandb.agent(sweep_id, function=evaluate_pipeline, count=3)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/wandb_agent.py", line 581, in agent
return pyagent(sweep_id, function, entity, project, count)
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 348, in pyagent
agent.run()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 326, in run
self._run_jobs_from_queue()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 220, in _run_jobs_from_queue
thread.join()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 1011, in join
self._wait_for_tstate_lock()
(Python) File "/mnt/home/liurenmi/software/anaconda3/envs/dance/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
More sysinfo below.
Machine that failed:
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
Machine that did not fail:
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
Skipping for now but might come back later to fix this issue if it appears to be happening to more users other than myself.