Bug description

When using 'Predict -> Run Training' in the GUI, the terminal shows ZMQError: Address already in use before shutting down the training.

Expected behaviour

I should be able to train my algorithm with videos properly labelled with a properly defined skeleton with nodes and edges.

Actual behaviour

After labelling 100 frames from two videos I went to 'Predict -> Run training'. I selected 'single animal' training, set a 'Run Name Prefix', and set 'Predict On' to 'random frames'. Then I clicked 'Run' which led to training windows closing and terminal showing ZMQError: Address already in use (see log below).

My attempt at solving the issue

As I let 'Controller Port' and 'Publish Port' set to 9000 and 9001, I checked their status using sudo netstat -tulnp | grep -E ':9000|:9001'. They were not in use. I also checked the port currently used by sleap which was 3643, which shouldn't cause any conflict. I tried other free ports with different configurations: Controller Port' and 'Publish Port' set to 25000 and 25001 / Controller Port' and 'Publish Port' set to 5000 and 5001. Led to the exact same error.

Your personal set up

OS: Description: Ubuntu 24.04.1 LTS Release: 24.04 Codename: noble
Version(s): SLEAP v1.4.1 Python 3.12.8 conda 25.1.0
SLEAP installation method (listed here):
- [x] Conda from package
- [ ] Conda from source
- [ ] pip package
- [ ] Apple Silicon Macs

Environment packages

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
anaconda-anon-usage       0.5.0           py312hfc0e8ea_100  
archspec                  0.2.3              pyhd3eb1b0_0  
boltons                   24.1.0          py312h06a4308_0  
brotli-python             1.0.9           py312h6a678d5_9  
bzip2                     1.0.8                h5eee18b_6  
c-ares                    1.19.1               h5eee18b_0  
ca-certificates           2024.12.31           h06a4308_0  
certifi                   2024.12.14      py312h06a4308_0  
cffi                      1.17.1          py312h1fdaa30_1  
charset-normalizer        3.3.2              pyhd3eb1b0_0  
conda                     25.1.0          py312h06a4308_0  
conda-anaconda-telemetry  0.1.2           py312h06a4308_0  
conda-content-trust       0.2.0           py312h06a4308_1  
conda-libmamba-solver     25.1.1             pyhd3eb1b0_0  
conda-package-handling    2.4.0           py312h06a4308_0  
conda-package-streaming   0.11.0          py312h06a4308_0  
cpp-expected              1.1.0                hdb19cb5_0  
cryptography              43.0.3          py312h7825ff9_1  
distro                    1.9.0           py312h06a4308_0  
expat                     2.6.4                h6a678d5_0  
fmt                       9.1.0                hdb19cb5_1  
frozendict                2.4.2           py312h06a4308_0  
icu                       73.1                 h6a678d5_0  
idna                      3.7             py312h06a4308_0  
jsonpatch                 1.33            py312h06a4308_1  
jsonpointer               2.1                pyhd3eb1b0_0  
krb5                      1.20.1               h143b758_1  
ld_impl_linux-64          2.40                 h12ee557_0  
libarchive                3.7.7                hfab0078_0  
libcurl                   8.11.1               hc9e6f67_0  
libedit                   3.1.20230828         h5eee18b_0  
libev                     4.33                 h7f8727e_1  
libffi                    3.4.4                h6a678d5_1  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libmamba                  2.0.5                haf1ee3a_1  
libmambapy                2.0.5           py312hdb19cb5_1  
libnghttp2                1.57.0               h2d74bed_0  
libsolv                   0.7.30               he621ea3_1  
libssh2                   1.11.1               h251f7ec_0  
libstdcxx-ng              11.2.0               h1234567_1  
libuuid                   1.41.5               h5eee18b_0  
libxml2                   2.13.5               hfdd30dd_0  
lz4-c                     1.9.4                h6a678d5_1  
menuinst                  2.2.0           py312h06a4308_0  
ncurses                   6.4                  h6a678d5_0  
nlohmann_json             3.11.2               h6a678d5_0  
openssl                   3.0.15               h5eee18b_0  
packaging                 24.2            py312h06a4308_0  
pcre2                     10.42                hebb0a14_1  
pip                       24.2            py312h06a4308_0  
platformdirs              3.10.0          py312h06a4308_0  
pluggy                    1.5.0           py312h06a4308_0  
pybind11-abi              5                    hd3eb1b0_0  
pycosat                   0.6.6           py312h5eee18b_2  
pycparser                 2.21               pyhd3eb1b0_0  
pysocks                   1.7.1           py312h06a4308_0  
python                    3.12.8               h5148396_0  
readline                  8.2                  h5eee18b_0  
reproc                    14.2.4               h6a678d5_2  
reproc-cpp                14.2.4               h6a678d5_2  
requests                  2.32.3          py312h06a4308_1  
ruamel.yaml               0.18.6          py312h5eee18b_0  
ruamel.yaml.clib          0.2.8           py312h5eee18b_0  
setuptools                75.1.0          py312h06a4308_0  
simdjson                  3.10.1               hdb19cb5_0  
spdlog                    1.11.0               hdb19cb5_0  
sqlite                    3.45.3               h5eee18b_0  
tk                        8.6.14               h39e8969_0  
tqdm                      4.66.5          py312he106c6f_0  
truststore                0.10.0          py312h06a4308_0  
tzdata                    2025a                h04d1e81_0  
urllib3                   2.3.0           py312h06a4308_0  
wheel                     0.44.0          py312h06a4308_0  
xz                        5.4.6                h5eee18b_1  
yaml-cpp                  0.8.0                h6a678d5_1  
zlib                      1.2.13               h5eee18b_1  
zstandard                 0.23.0          py312h2c38b39_1  
zstd                      1.5.6                hc292b87_0

Logs

Saving config: /home/sc-bclemot/.sleap/1.4.1/preferences.yaml
Restoring GUI state...
2025-01-29 10:09:24.117712: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 10:09:24.142243: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 10:09:24.143635: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

Software versions:
SLEAP: 1.4.1
TensorFlow: 2.7.0
Numpy: 1.21.6
Python: 3.7.12
OS: Linux-6.8.0-51-generic-x86_64-with-debian-trixie-sid

Happy SLEAPing! :)
Traceback (most recent call last):
  File "/home/sc-bclemot/miniconda3/envs/sleap/lib/python3.7/site-packages/sleap/gui/learning/dialog.py", line 751, in run
    items_for_inference=items_for_inference,
  File "/home/sc-bclemot/miniconda3/envs/sleap/lib/python3.7/site-packages/sleap/gui/learning/runners.py", line 572, in run_learning_pipeline
    keep_viz=keep_viz,
  File "/home/sc-bclemot/miniconda3/envs/sleap/lib/python3.7/site-packages/sleap/gui/learning/runners.py", line 628, in run_gui_training
    win = LossViewer(zmq_ports=zmq_ports)
  File "/home/sc-bclemot/miniconda3/envs/sleap/lib/python3.7/site-packages/sleap/gui/widgets/monitor.py", line 622, in __init__
    self._setup_zmq(zmq_context)
  File "/home/sc-bclemot/miniconda3/envs/sleap/lib/python3.7/site-packages/sleap/gui/widgets/monitor.py", line 820, in _setup_zmq
    self.sub.bind(publish_address)
  File "/home/sc-bclemot/miniconda3/envs/sleap/lib/python3.7/site-packages/zmq/sugar/socket.py", line 232, in bind
    super().bind(addr)
  File "zmq/backend/cython/socket.pyx", line 568, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 28, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use

How to reproduce

Open SLEAP project with 2 videos, a skeleton with 4 nodes and 4 edges, 100 frames selected with sample and labelled.
Go to 'Predict -> Run Training...'
Select 'single animal' for Pipeline Type.
Set Run Name Prefix to 'beetle_test1'
Set Predict On to 'Random Frames'.
Leave the rest of the setup unchanged, which means:- Sigma for nodes at 2.50.

Controller Port: 9000
Punlish Port: 9001
Runs Folder: model
Only Best Model and Visualize predictions During Training checked.

Click on 'Run'
See Error in terminal

Jan 29 '25 09:01 bClemot-Sc

Information update on the issue

When starting 'Run Training' multiple times in a row, it finally starts the training. It occurs approximately 1 time out of 5 and is consistent even after restarting my computer.

However, when the training goes past the ZMQError and starts, it unexpectedly stops the training before completing the first epoch and asks me to look for an error in the terminal which shows no error (See log below).

Logs

Resetting monitor window.
Polling: /home/sc-bclemot/Documents/SLEAP_projects/models/250129_121538.single_instance.n=100/viz/validation.*.png
Start training single_instance...
['sleap-train', '/tmp/tmp7og4myw5/250129_121539_training_job.json', '/home/sc-bclemot/Documents/SLEAP_projects/BeetleTrackT1.v001.slp', '--zmq', '--controller_port', '9000', '--publish_port', '9001', '--save_viz']
INFO:sleap.nn.training:Versions:
SLEAP: 1.4.1
TensorFlow: 2.7.0
Numpy: 1.21.6
Python: 3.7.12
OS: Linux-6.8.0-51-generic-x86_64-with-debian-trixie-sid
INFO:sleap.nn.training:Training labels file: /home/sc-bclemot/Documents/SLEAP_projects/BeetleTrackT1.v001.slp
INFO:sleap.nn.training:Training profile: /tmp/tmp7og4myw5/250129_121539_training_job.json
INFO:sleap.nn.training:
INFO:sleap.nn.training:Arguments:
INFO:sleap.nn.training:{
    "training_job_path": "/tmp/tmp7og4myw5/250129_121539_training_job.json",
    "labels_path": "/home/sc-bclemot/Documents/SLEAP_projects/BeetleTrackT1.v001.slp",
    "video_paths": [
        ""
    ],
    "val_labels": null,
    "test_labels": null,
    "base_checkpoint": null,
    "tensorboard": false,
    "save_viz": true,
    "keep_viz": false,
    "zmq": true,
    "publish_port": 9001,
    "controller_port": 9000,
    "run_name": "",
    "prefix": "",
    "suffix": "",
    "cpu": false,
    "first_gpu": false,
    "last_gpu": false,
    "gpu": "auto"
}
INFO:sleap.nn.training:
INFO:sleap.nn.training:Training job:
INFO:sleap.nn.training:{
    "data": {
        "labels": {
            "training_labels": null,
            "validation_labels": null,
            "validation_fraction": 0.1,
            "test_labels": null,
            "split_by_inds": false,
            "training_inds": null,
            "validation_inds": null,
            "test_inds": null,
            "search_path_hints": [],
            "skeletons": []
        },
        "preprocessing": {
            "ensure_rgb": false,
            "ensure_grayscale": false,
            "imagenet_mode": null,
            "input_scaling": 5.0,
            "pad_to_stride": null,
            "resize_and_pad_to_target": true,
            "target_height": null,
            "target_width": null
        },
        "instance_cropping": {
            "center_on_part": null,
            "crop_size": null,
            "crop_size_detection_padding": 16
        }
    },
    "model": {
        "backbone": {
            "leap": null,
            "unet": {
                "stem_stride": null,
                "max_stride": 16,
                "output_stride": 2,
                "filters": 16,
                "filters_rate": 2.0,
                "middle_block": true,
                "up_interpolate": true,
                "stacks": 1
            },
            "hourglass": null,
            "resnet": null,
            "pretrained_encoder": null
        },
        "heads": {
            "single_instance": {
                "part_names": null,
                "sigma": 2.5,
                "output_stride": 2,
                "loss_weight": 1.0,
                "offset_refinement": false
            },
            "centroid": null,
            "centered_instance": null,
            "multi_instance": null,
            "multi_class_bottomup": null,
            "multi_class_topdown": null
        },
        "base_checkpoint": null
    },
    "optimization": {
        "preload_data": true,
        "augmentation_config": {
            "rotate": true,
            "rotation_min_angle": -15.0,
            "rotation_max_angle": 15.0,
            "translate": false,
            "translate_min": -5,
            "translate_max": 5,
            "scale": false,
            "scale_min": 0.9,
            "scale_max": 1.1,
            "uniform_noise": false,
            "uniform_noise_min_val": 0.0,
            "uniform_noise_max_val": 10.0,
            "gaussian_noise": false,
            "gaussian_noise_mean": 5.0,
            "gaussian_noise_stddev": 1.0,
            "contrast": false,
            "contrast_min_gamma": 0.5,
            "contrast_max_gamma": 2.0,
            "brightness": false,
            "brightness_min_val": 0.0,
            "brightness_max_val": 10.0,
            "random_crop": false,
            "random_crop_height": 256,
            "random_crop_width": 256,
            "random_flip": true,
            "flip_horizontal": false
        },
        "online_shuffling": true,
        "shuffle_buffer_size": 128,
        "prefetch": true,
        "batch_size": 4,
        "batches_per_epoch": null,
        "min_batches_per_epoch": 200,
        "val_batches_per_epoch": null,
        "min_val_batches_per_epoch": 10,
        "epochs": 200,
        "optimizer": "adam",
        "initial_learning_rate": 0.0001,
        "learning_rate_schedule": {
            "reduce_on_plateau": true,
            "reduction_factor": 0.5,
            "plateau_min_delta": 1e-06,
            "plateau_patience": 5,
            "plateau_cooldown": 3,
            "min_learning_rate": 1e-08
        },
        "hard_keypoint_mining": {
            "online_mining": false,
            "hard_to_easy_ratio": 2.0,
            "min_hard_keypoints": 2,
            "max_hard_keypoints": null,
            "loss_scale": 5.0
        },
        "early_stopping": {
            "stop_training_on_plateau": true,
            "plateau_min_delta": 1e-08,
            "plateau_patience": 10
        }
    },
    "outputs": {
        "save_outputs": true,
        "run_name": "250129_121538.single_instance.n=100",
        "run_name_prefix": "",
        "run_name_suffix": "",
        "runs_folder": "/home/sc-bclemot/Documents/SLEAP_projects/models",
        "tags": [
            ""
        ],
        "save_visualizations": true,
        "keep_viz_images": false,
        "zip_outputs": false,
        "log_to_csv": true,
        "checkpointing": {
            "initial_model": false,
            "best_model": true,
            "every_epoch": false,
            "latest_model": false,
            "final_model": false
        },
        "tensorboard": {
            "write_logs": false,
            "loss_frequency": "epoch",
            "architecture_graph": false,
            "profile_graph": false,
            "visualizations": true
        },
        "zmq": {
            "subscribe_to_controller": true,
            "controller_address": "tcp://127.0.0.1:9000",
            "controller_polling_timeout": 10,
            "publish_updates": true,
            "publish_address": "tcp://127.0.0.1:9001"
        }
    },
    "name": "",
    "description": "",
    "sleap_version": "1.4.1",
    "filename": "/tmp/tmp7og4myw5/250129_121539_training_job.json"
}
INFO:sleap.nn.training:
2025-01-29 12:15:41.805758: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:41.827565: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:41.831268: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
INFO:sleap.nn.training:Auto-selected GPU 0 with 7801 MiB of free memory.
INFO:sleap.nn.training:Using GPU 0 for acceleration.
INFO:sleap.nn.training:Disabled GPU memory pre-allocation.
INFO:sleap.nn.training:System:
GPUs: 1/1 available
  Device: /physical_device:GPU:0
         Available: True
       Initialized: False
     Memory growth: True
INFO:sleap.nn.training:
INFO:sleap.nn.training:Initializing trainer...
INFO:sleap.nn.training:Loading training labels from: /home/sc-bclemot/Documents/SLEAP_projects/BeetleTrackT1.v001.slp
INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.1
INFO:sleap.nn.training:  Splits: Training = 90 / Validation = 10.
INFO:sleap.nn.training:Setting up for training...
INFO:sleap.nn.training:Setting up pipeline builders...
INFO:sleap.nn.training:Setting up model...
INFO:sleap.nn.training:Building test pipeline...
2025-01-29 12:15:43.035642: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-29 12:15:43.037239: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:43.041201: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:43.044792: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:43.525081: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:43.526854: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:43.528297: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2025-01-29 12:15:43.529937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5975 MB memory:  -> device: 0, name: NVIDIA RTX 3000 Ada Generation Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9
INFO:sleap.nn.training:Loaded test example. [5.248s]
INFO:sleap.nn.training:  Input shape: (10800, 19200, 3)
INFO:sleap.nn.training:Created Keras model.
INFO:sleap.nn.training:  Backbone: UNet(stacks=1, filters=16, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=4, middle_block=True, up_blocks=3, up_interpolate=True, block_contraction=False)
INFO:sleap.nn.training:  Max stride: 16
INFO:sleap.nn.training:  Parameters: 1,953,492
INFO:sleap.nn.training:  Heads: 
INFO:sleap.nn.training:    [0] = SingleInstanceConfmapsHead(part_names=['center', 'left', 'right', 'top'], sigma=2.5, output_stride=2, loss_weight=1.0)
INFO:sleap.nn.training:  Outputs: 
INFO:sleap.nn.training:    [0] = KerasTensor(type_spec=TensorSpec(shape=(None, 5400, 9600, 4), dtype=tf.float32, name=None), name='SingleInstanceConfmapsHead/BiasAdd:0', description="created by layer 'SingleInstanceConfmapsHead'")
INFO:sleap.nn.training:Training from scratch
INFO:sleap.nn.training:Setting up data pipelines...
INFO:sleap.nn.training:Training set: n = 90
INFO:sleap.nn.training:Validation set: n = 10
INFO:sleap.nn.training:Setting up optimization...
INFO:sleap.nn.training:  Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08)
INFO:sleap.nn.training:  Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-08, plateau_patience=10)
INFO:sleap.nn.training:Setting up outputs...
INFO:sleap.nn.callbacks:Training controller subscribed to: tcp://127.0.0.1:9000 (topic: )
INFO:sleap.nn.training:  ZMQ controller subcribed to: tcp://127.0.0.1:9000
INFO:sleap.nn.callbacks:Progress reporter publishing on: tcp://127.0.0.1:9001 for: not_set
INFO:sleap.nn.training:  ZMQ progress reporter publish on: tcp://127.0.0.1:9001
INFO:sleap.nn.training:Created run path: /home/sc-bclemot/Documents/SLEAP_projects/models/250129_121538.single_instance.n=100
INFO:sleap.nn.training:Setting up visualization...
INFO:sleap.nn.training:Finished trainer set up. [6.0s]
INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation...
INFO:sleap.nn.training:Finished creating training datasets. [23.2s]
INFO:sleap.nn.training:Starting training loop...
Epoch 1/200
Run Path: /home/sc-bclemot/Documents/SLEAP_projects/models/250129_121538.single_instance.n=100

Additional info

CUDA and TensorFlow recognise my GPU (NVIDIA RTX 3000 Ada) and I ran trials to look at my memory usage which seemed alright. When starting the training, nvidia-sim shows that my GPU is being used and the memory usage goes up to 5-10% only, therefore I think my GPU is properly used by SLEAP and I'm not running into an OOM Error.

Jan 29 '25 11:01 bClemot-Sc

Hi @bClemot-Sc !

Yes, we have a working PR to resolve this issue: #2064. Once it's merged in, you can use this new feature by installing sleap from source:

git clone https://github.com/talmolab/sleap && cd sleap
conda env create -f environment.yml -n sleap_dev Ref: conda from source

Let us know if you have any questions!

Thanks,

Divya

Jan 29 '25 18:01 gitttt-1234

Hi @gitttt-1234 !

Thank you so much for your rapid answer and for being on the problem.

Does this issue also cover my second message related to the training abruptly stopping?
How can I get notified when it has been merged?

Thanks, Bastien

Jan 30 '25 08:01 bClemot-Sc

Hi @bClemot-Sc!

I feel like this might be due to the receptive field size (your image size also seems pretty big (10800, 19200, 3). Is this your source video resolution?). I'm linking a previous discussion here, could you try the workaround? You could also try the top-down model, you might have to train 2 models (centroid and centered-instance) in this case.

For getting notified, you can subscribe to this repository to monitor the progress (this sends you email notifications on all the events happening within the repo). I could update you here as well once we have the PR merged.

Let us know if you have any questions!!

Jan 30 '25 17:01 gitttt-1234

'ZMQError: Address already in use' when runing training

Bug description

Expected behaviour

Actual behaviour

My attempt at solving the issue

Your personal set up

How to reproduce

Information update on the issue

Additional info