'ZMQError: Address already in use' when runing training
Bug description
When using 'Predict -> Run Training' in the GUI, the terminal shows ZMQError: Address already in use before shutting down the training.
Expected behaviour
I should be able to train my algorithm with videos properly labelled with a properly defined skeleton with nodes and edges.
Actual behaviour
After labelling 100 frames from two videos I went to 'Predict -> Run training'.
I selected 'single animal' training, set a 'Run Name Prefix', and set 'Predict On' to 'random frames'.
Then I clicked 'Run' which led to training windows closing and terminal showing ZMQError: Address already in use (see log below).
My attempt at solving the issue
As I let 'Controller Port' and 'Publish Port' set to 9000 and 9001, I checked their status using sudo netstat -tulnp | grep -E ':9000|:9001'. They were not in use. I also checked the port currently used by sleap which was 3643, which shouldn't cause any conflict. I tried other free ports with different configurations: Controller Port' and 'Publish Port' set to 25000 and 25001 / Controller Port' and 'Publish Port' set to 5000 and 5001. Led to the exact same error.
Your personal set up
-
OS: Description: Ubuntu 24.04.1 LTS Release: 24.04 Codename: noble
-
Version(s): SLEAP v1.4.1 Python 3.12.8 conda 25.1.0
-
SLEAP installation method (listed here):
- [x] Conda from package
- [ ] Conda from source
- [ ] pip package
- [ ] Apple Silicon Macs
Environment packages
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
anaconda-anon-usage 0.5.0 py312hfc0e8ea_100
archspec 0.2.3 pyhd3eb1b0_0
boltons 24.1.0 py312h06a4308_0
brotli-python 1.0.9 py312h6a678d5_9
bzip2 1.0.8 h5eee18b_6
c-ares 1.19.1 h5eee18b_0
ca-certificates 2024.12.31 h06a4308_0
certifi 2024.12.14 py312h06a4308_0
cffi 1.17.1 py312h1fdaa30_1
charset-normalizer 3.3.2 pyhd3eb1b0_0
conda 25.1.0 py312h06a4308_0
conda-anaconda-telemetry 0.1.2 py312h06a4308_0
conda-content-trust 0.2.0 py312h06a4308_1
conda-libmamba-solver 25.1.1 pyhd3eb1b0_0
conda-package-handling 2.4.0 py312h06a4308_0
conda-package-streaming 0.11.0 py312h06a4308_0
cpp-expected 1.1.0 hdb19cb5_0
cryptography 43.0.3 py312h7825ff9_1
distro 1.9.0 py312h06a4308_0
expat 2.6.4 h6a678d5_0
fmt 9.1.0 hdb19cb5_1
frozendict 2.4.2 py312h06a4308_0
icu 73.1 h6a678d5_0
idna 3.7 py312h06a4308_0
jsonpatch 1.33 py312h06a4308_1
jsonpointer 2.1 pyhd3eb1b0_0
krb5 1.20.1 h143b758_1
ld_impl_linux-64 2.40 h12ee557_0
libarchive 3.7.7 hfab0078_0
libcurl 8.11.1 hc9e6f67_0
libedit 3.1.20230828 h5eee18b_0
libev 4.33 h7f8727e_1
libffi 3.4.4 h6a678d5_1
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libmamba 2.0.5 haf1ee3a_1
libmambapy 2.0.5 py312hdb19cb5_1
libnghttp2 1.57.0 h2d74bed_0
libsolv 0.7.30 he621ea3_1
libssh2 1.11.1 h251f7ec_0
libstdcxx-ng 11.2.0 h1234567_1
libuuid 1.41.5 h5eee18b_0
libxml2 2.13.5 hfdd30dd_0
lz4-c 1.9.4 h6a678d5_1
menuinst 2.2.0 py312h06a4308_0
ncurses 6.4 h6a678d5_0
nlohmann_json 3.11.2 h6a678d5_0
openssl 3.0.15 h5eee18b_0
packaging 24.2 py312h06a4308_0
pcre2 10.42 hebb0a14_1
pip 24.2 py312h06a4308_0
platformdirs 3.10.0 py312h06a4308_0
pluggy 1.5.0 py312h06a4308_0
pybind11-abi 5 hd3eb1b0_0
pycosat 0.6.6 py312h5eee18b_2
pycparser 2.21 pyhd3eb1b0_0
pysocks 1.7.1 py312h06a4308_0
python 3.12.8 h5148396_0
readline 8.2 h5eee18b_0
reproc 14.2.4 h6a678d5_2
reproc-cpp 14.2.4 h6a678d5_2
requests 2.32.3 py312h06a4308_1
ruamel.yaml 0.18.6 py312h5eee18b_0
ruamel.yaml.clib 0.2.8 py312h5eee18b_0
setuptools 75.1.0 py312h06a4308_0
simdjson 3.10.1 hdb19cb5_0
spdlog 1.11.0 hdb19cb5_0
sqlite 3.45.3 h5eee18b_0
tk 8.6.14 h39e8969_0
tqdm 4.66.5 py312he106c6f_0
truststore 0.10.0 py312h06a4308_0
tzdata 2025a h04d1e81_0
urllib3 2.3.0 py312h06a4308_0
wheel 0.44.0 py312h06a4308_0
xz 5.4.6 h5eee18b_1
yaml-cpp 0.8.0 h6a678d5_1
zlib 1.2.13 h5eee18b_1
zstandard 0.23.0 py312h2c38b39_1
zstd 1.5.6 hc292b87_0
Logs
Saving config: /home/sc-bclemot/.sleap/1.4.1/preferences.yaml
Restoring GUI state...
2025-01-29 10:09:24.117712: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 10:09:24.142243: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 10:09:24.143635: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Software versions:
SLEAP: 1.4.1
TensorFlow: 2.7.0
Numpy: 1.21.6
Python: 3.7.12
OS: Linux-6.8.0-51-generic-x86_64-with-debian-trixie-sid
Happy SLEAPing! :)
Traceback (most recent call last):
File "/home/sc-bclemot/miniconda3/envs/sleap/lib/python3.7/site-packages/sleap/gui/learning/dialog.py", line 751, in run
items_for_inference=items_for_inference,
File "/home/sc-bclemot/miniconda3/envs/sleap/lib/python3.7/site-packages/sleap/gui/learning/runners.py", line 572, in run_learning_pipeline
keep_viz=keep_viz,
File "/home/sc-bclemot/miniconda3/envs/sleap/lib/python3.7/site-packages/sleap/gui/learning/runners.py", line 628, in run_gui_training
win = LossViewer(zmq_ports=zmq_ports)
File "/home/sc-bclemot/miniconda3/envs/sleap/lib/python3.7/site-packages/sleap/gui/widgets/monitor.py", line 622, in __init__
self._setup_zmq(zmq_context)
File "/home/sc-bclemot/miniconda3/envs/sleap/lib/python3.7/site-packages/sleap/gui/widgets/monitor.py", line 820, in _setup_zmq
self.sub.bind(publish_address)
File "/home/sc-bclemot/miniconda3/envs/sleap/lib/python3.7/site-packages/zmq/sugar/socket.py", line 232, in bind
super().bind(addr)
File "zmq/backend/cython/socket.pyx", line 568, in zmq.backend.cython.socket.Socket.bind
File "zmq/backend/cython/checkrc.pxd", line 28, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use
How to reproduce
- Open SLEAP project with 2 videos, a skeleton with 4 nodes and 4 edges, 100 frames selected with sample and labelled.
- Go to 'Predict -> Run Training...'
- Select 'single animal' for Pipeline Type.
- Set Run Name Prefix to 'beetle_test1'
- Set Predict On to 'Random Frames'.
- Leave the rest of the setup unchanged, which means:- Sigma for nodes at 2.50.
- Controller Port: 9000
- Punlish Port: 9001
- Runs Folder: model
- Only Best Model and Visualize predictions During Training checked.
- Click on 'Run'
- See Error in terminal
Information update on the issue
When starting 'Run Training' multiple times in a row, it finally starts the training. It occurs approximately 1 time out of 5 and is consistent even after restarting my computer.
However, when the training goes past the ZMQError and starts, it unexpectedly stops the training before completing the first epoch and asks me to look for an error in the terminal which shows no error (See log below).
Logs
Resetting monitor window.
Polling: /home/sc-bclemot/Documents/SLEAP_projects/models/250129_121538.single_instance.n=100/viz/validation.*.png
Start training single_instance...
['sleap-train', '/tmp/tmp7og4myw5/250129_121539_training_job.json', '/home/sc-bclemot/Documents/SLEAP_projects/BeetleTrackT1.v001.slp', '--zmq', '--controller_port', '9000', '--publish_port', '9001', '--save_viz']
INFO:sleap.nn.training:Versions:
SLEAP: 1.4.1
TensorFlow: 2.7.0
Numpy: 1.21.6
Python: 3.7.12
OS: Linux-6.8.0-51-generic-x86_64-with-debian-trixie-sid
INFO:sleap.nn.training:Training labels file: /home/sc-bclemot/Documents/SLEAP_projects/BeetleTrackT1.v001.slp
INFO:sleap.nn.training:Training profile: /tmp/tmp7og4myw5/250129_121539_training_job.json
INFO:sleap.nn.training:
INFO:sleap.nn.training:Arguments:
INFO:sleap.nn.training:{
"training_job_path": "/tmp/tmp7og4myw5/250129_121539_training_job.json",
"labels_path": "/home/sc-bclemot/Documents/SLEAP_projects/BeetleTrackT1.v001.slp",
"video_paths": [
""
],
"val_labels": null,
"test_labels": null,
"base_checkpoint": null,
"tensorboard": false,
"save_viz": true,
"keep_viz": false,
"zmq": true,
"publish_port": 9001,
"controller_port": 9000,
"run_name": "",
"prefix": "",
"suffix": "",
"cpu": false,
"first_gpu": false,
"last_gpu": false,
"gpu": "auto"
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Training job:
INFO:sleap.nn.training:{
"data": {
"labels": {
"training_labels": null,
"validation_labels": null,
"validation_fraction": 0.1,
"test_labels": null,
"split_by_inds": false,
"training_inds": null,
"validation_inds": null,
"test_inds": null,
"search_path_hints": [],
"skeletons": []
},
"preprocessing": {
"ensure_rgb": false,
"ensure_grayscale": false,
"imagenet_mode": null,
"input_scaling": 5.0,
"pad_to_stride": null,
"resize_and_pad_to_target": true,
"target_height": null,
"target_width": null
},
"instance_cropping": {
"center_on_part": null,
"crop_size": null,
"crop_size_detection_padding": 16
}
},
"model": {
"backbone": {
"leap": null,
"unet": {
"stem_stride": null,
"max_stride": 16,
"output_stride": 2,
"filters": 16,
"filters_rate": 2.0,
"middle_block": true,
"up_interpolate": true,
"stacks": 1
},
"hourglass": null,
"resnet": null,
"pretrained_encoder": null
},
"heads": {
"single_instance": {
"part_names": null,
"sigma": 2.5,
"output_stride": 2,
"loss_weight": 1.0,
"offset_refinement": false
},
"centroid": null,
"centered_instance": null,
"multi_instance": null,
"multi_class_bottomup": null,
"multi_class_topdown": null
},
"base_checkpoint": null
},
"optimization": {
"preload_data": true,
"augmentation_config": {
"rotate": true,
"rotation_min_angle": -15.0,
"rotation_max_angle": 15.0,
"translate": false,
"translate_min": -5,
"translate_max": 5,
"scale": false,
"scale_min": 0.9,
"scale_max": 1.1,
"uniform_noise": false,
"uniform_noise_min_val": 0.0,
"uniform_noise_max_val": 10.0,
"gaussian_noise": false,
"gaussian_noise_mean": 5.0,
"gaussian_noise_stddev": 1.0,
"contrast": false,
"contrast_min_gamma": 0.5,
"contrast_max_gamma": 2.0,
"brightness": false,
"brightness_min_val": 0.0,
"brightness_max_val": 10.0,
"random_crop": false,
"random_crop_height": 256,
"random_crop_width": 256,
"random_flip": true,
"flip_horizontal": false
},
"online_shuffling": true,
"shuffle_buffer_size": 128,
"prefetch": true,
"batch_size": 4,
"batches_per_epoch": null,
"min_batches_per_epoch": 200,
"val_batches_per_epoch": null,
"min_val_batches_per_epoch": 10,
"epochs": 200,
"optimizer": "adam",
"initial_learning_rate": 0.0001,
"learning_rate_schedule": {
"reduce_on_plateau": true,
"reduction_factor": 0.5,
"plateau_min_delta": 1e-06,
"plateau_patience": 5,
"plateau_cooldown": 3,
"min_learning_rate": 1e-08
},
"hard_keypoint_mining": {
"online_mining": false,
"hard_to_easy_ratio": 2.0,
"min_hard_keypoints": 2,
"max_hard_keypoints": null,
"loss_scale": 5.0
},
"early_stopping": {
"stop_training_on_plateau": true,
"plateau_min_delta": 1e-08,
"plateau_patience": 10
}
},
"outputs": {
"save_outputs": true,
"run_name": "250129_121538.single_instance.n=100",
"run_name_prefix": "",
"run_name_suffix": "",
"runs_folder": "/home/sc-bclemot/Documents/SLEAP_projects/models",
"tags": [
""
],
"save_visualizations": true,
"keep_viz_images": false,
"zip_outputs": false,
"log_to_csv": true,
"checkpointing": {
"initial_model": false,
"best_model": true,
"every_epoch": false,
"latest_model": false,
"final_model": false
},
"tensorboard": {
"write_logs": false,
"loss_frequency": "epoch",
"architecture_graph": false,
"profile_graph": false,
"visualizations": true
},
"zmq": {
"subscribe_to_controller": true,
"controller_address": "tcp://127.0.0.1:9000",
"controller_polling_timeout": 10,
"publish_updates": true,
"publish_address": "tcp://127.0.0.1:9001"
}
},
"name": "",
"description": "",
"sleap_version": "1.4.1",
"filename": "/tmp/tmp7og4myw5/250129_121539_training_job.json"
}
INFO:sleap.nn.training:
2025-01-29 12:15:41.805758: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:41.827565: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:41.831268: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
INFO:sleap.nn.training:Auto-selected GPU 0 with 7801 MiB of free memory.
INFO:sleap.nn.training:Using GPU 0 for acceleration.
INFO:sleap.nn.training:Disabled GPU memory pre-allocation.
INFO:sleap.nn.training:System:
GPUs: 1/1 available
Device: /physical_device:GPU:0
Available: True
Initialized: False
Memory growth: True
INFO:sleap.nn.training:
INFO:sleap.nn.training:Initializing trainer...
INFO:sleap.nn.training:Loading training labels from: /home/sc-bclemot/Documents/SLEAP_projects/BeetleTrackT1.v001.slp
INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1
INFO:sleap.nn.training: Splits: Training = 90 / Validation = 10.
INFO:sleap.nn.training:Setting up for training...
INFO:sleap.nn.training:Setting up pipeline builders...
INFO:sleap.nn.training:Setting up model...
INFO:sleap.nn.training:Building test pipeline...
2025-01-29 12:15:43.035642: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-29 12:15:43.037239: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:43.041201: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:43.044792: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:43.525081: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:43.526854: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:43.528297: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:43.529937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5975 MB memory: -> device: 0, name: NVIDIA RTX 3000 Ada Generation Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9
INFO:sleap.nn.training:Loaded test example. [5.248s]
INFO:sleap.nn.training: Input shape: (10800, 19200, 3)
INFO:sleap.nn.training:Created Keras model.
INFO:sleap.nn.training: Backbone: UNet(stacks=1, filters=16, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=4, middle_block=True, up_blocks=3, up_interpolate=True, block_contraction=False)
INFO:sleap.nn.training: Max stride: 16
INFO:sleap.nn.training: Parameters: 1,953,492
INFO:sleap.nn.training: Heads:
INFO:sleap.nn.training: [0] = SingleInstanceConfmapsHead(part_names=['center', 'left', 'right', 'top'], sigma=2.5, output_stride=2, loss_weight=1.0)
INFO:sleap.nn.training: Outputs:
INFO:sleap.nn.training: [0] = KerasTensor(type_spec=TensorSpec(shape=(None, 5400, 9600, 4), dtype=tf.float32, name=None), name='SingleInstanceConfmapsHead/BiasAdd:0', description="created by layer 'SingleInstanceConfmapsHead'")
INFO:sleap.nn.training:Training from scratch
INFO:sleap.nn.training:Setting up data pipelines...
INFO:sleap.nn.training:Training set: n = 90
INFO:sleap.nn.training:Validation set: n = 10
INFO:sleap.nn.training:Setting up optimization...
INFO:sleap.nn.training: Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08)
INFO:sleap.nn.training: Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-08, plateau_patience=10)
INFO:sleap.nn.training:Setting up outputs...
INFO:sleap.nn.callbacks:Training controller subscribed to: tcp://127.0.0.1:9000 (topic: )
INFO:sleap.nn.training: ZMQ controller subcribed to: tcp://127.0.0.1:9000
INFO:sleap.nn.callbacks:Progress reporter publishing on: tcp://127.0.0.1:9001 for: not_set
INFO:sleap.nn.training: ZMQ progress reporter publish on: tcp://127.0.0.1:9001
INFO:sleap.nn.training:Created run path: /home/sc-bclemot/Documents/SLEAP_projects/models/250129_121538.single_instance.n=100
INFO:sleap.nn.training:Setting up visualization...
INFO:sleap.nn.training:Finished trainer set up. [6.0s]
INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation...
INFO:sleap.nn.training:Finished creating training datasets. [23.2s]
INFO:sleap.nn.training:Starting training loop...
Epoch 1/200
Run Path: /home/sc-bclemot/Documents/SLEAP_projects/models/250129_121538.single_instance.n=100
Additional info
CUDA and TensorFlow recognise my GPU (NVIDIA RTX 3000 Ada) and I ran trials to look at my memory usage which seemed alright. When starting the training, nvidia-sim shows that my GPU is being used and the memory usage goes up to 5-10% only, therefore I think my GPU is properly used by SLEAP and I'm not running into an OOM Error.
Hi @bClemot-Sc !
Yes, we have a working PR to resolve this issue: #2064. Once it's merged in, you can use this new feature by installing sleap from source:
git clone https://github.com/talmolab/sleap && cd sleapconda env create -f environment.yml -n sleap_devRef: conda from source
Let us know if you have any questions!
Thanks,
Divya
Hi @gitttt-1234 !
Thank you so much for your rapid answer and for being on the problem.
- Does this issue also cover my second message related to the training abruptly stopping?
- How can I get notified when it has been merged?
Thanks, Bastien
Hi @bClemot-Sc!
I feel like this might be due to the receptive field size (your image size also seems pretty big (10800, 19200, 3). Is this your source video resolution?). I'm linking a previous discussion here, could you try the workaround? You could also try the top-down model, you might have to train 2 models (centroid and centered-instance) in this case.
For getting notified, you can subscribe to this repository to monitor the progress (this sends you email notifications on all the events happening within the repo). I could update you here as well once we have the PR merged.
Let us know if you have any questions!!