SemiBin ValueError: num_samples should be a positive integer value, but got num

Hi,

I am encountering an error when running Semibin2.

I am using the multi-split approach on the CAMI2 airways dataset.

First I concatenate contigs using the script from VAMB:

concatenate.py catalogue.fna.gz path_to_contigs/

Then running strobealign to obtain a bam file for each sample.

strobealign catalogue.fna.gz path_to_reads/ | samtools sort -o path_to_bams/sample.bam

I then run SemiBin2 in with multi_easy_bin, using the separator C.

SemiBin2 multi_easy_bin -i catalogue.fna -b path_to_bams/*.bam -o output_semibin2 --separator C

And i get the following error when running Semibin2. It happens during training in the DataLoader.

2025-03-18 10:10:20 j-5191716-job-0 SemiBin[3443] INFO Setting number of CPUs to 192
2025-03-18 10:10:20 j-5191716-job-0 SemiBin[3443] INFO Binning for short_read
2025-03-18 10:10:20 j-5191716-job-0 SemiBin[3443] INFO SemiBin will run in self supervised mode
2025-03-18 10:10:22 j-5191716-job-0 SemiBin[3443] INFO Running with GPU.
2025-03-18 10:10:22 j-5191716-job-0 SemiBin[3443] INFO Performing multi-sample binning
2025-03-18 10:10:22 j-5191716-job-0 SemiBin[3443] INFO Generating training data...
2025-03-18 10:10:29 j-5191716-job-0 SemiBin[3443] INFO Calculating coverage for every sample.
2025-03-18 10:12:40 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/10_sorted.bam
2025-03-18 10:12:40 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/11_sorted.bam
2025-03-18 10:12:40 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/12_sorted.bam
2025-03-18 10:12:40 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/23_sorted.bam
2025-03-18 10:12:41 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/26_sorted.bam
2025-03-18 10:12:41 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/27_sorted.bam
2025-03-18 10:12:41 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/4_sorted.bam
2025-03-18 10:12:41 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/7_sorted.bam
2025-03-18 10:12:41 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/8_sorted.bam
2025-03-18 10:12:41 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/9_sorted.bam
2025-03-18 10:21:32 j-5191716-job-0 SemiBin[3443] INFO Training model and clustering for S1.
2025-03-18 10:21:32 j-5191716-job-0 SemiBin[3443] INFO Start training from a single sample.
2025-03-18 10:21:33 j-5191716-job-0 SemiBin[3443] INFO Training model...
  0%|                                                                                                                                                     | 0/15 [00:00<?, ?it/s]/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:3860: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/numpy/_core/_methods.py:147: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret / rcount
  0%|                                                                                                                                                     | 0/15 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/work/miniconda3/envs/SemiBint/bin/SemiBin2", line 10, in <module>
    sys.exit(main2())
             ^^^^^^^
  File "/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/SemiBin/main.py", line 1610, in main2
    multi_easy_binning(
  File "/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/SemiBin/main.py", line 1349, in multi_easy_binning
    training(logger, None, args.num_process,
  File "/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/SemiBin/main.py", line 1126, in training
    model = train_self(logger,
            ^^^^^^^^^^^^^^^^^^
  File "/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/SemiBin/self_supervised_model.py", line 109, in train_self
    train_loader = DataLoader(
                   ^^^^^^^^^^^
  File "/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 383, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/torch/utils/data/sampler.py", line 165, in __init__
    raise ValueError(
ValueError: num_samples should be a positive integer value, but got num_samples=0

This is my environment for semibin:

name: SemiBin
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - python
  - pip
  - semibin=2.1.0
  - pip:
      - torch
      - torchvision
      - torchaudio

Do you have any idea on how to fix it?

Thanks a lot in advance.

Best Regards Anders

Mar 18 '25 11:03 ah140797

Is the separator C the same that the concatenate.py file uses? It's important that they be consistent

Mar 19 '25 05:03 luispedro

Thanks for the reply.

Yes the separator C is the same as used in the file concatenate.py.

The contigs headers from catalogue.fna.gz are in this format:

<sample_name>C<original_contig_name> S8CS23C153210 S8CS23C73612

Mar 19 '25 07:03 ah140797

The contigs names also have the separator? That is normally not good (and the internal SemiBin2 concatenate_fasta command would have checked for that and errored out).

Mar 19 '25 11:03 luispedro

Yes the contigs names also have the separator C. Both in the catalogue.fna.gz and in the bam-files.

I understand that the Semibin2 concatenate_fasta uses the separator : to obtain this format: <sample_name>:<original_contig_name>, and that using the argument --separator C would make it work for the format I have.

Mar 20 '25 08:03 ah140797

Yes, but it would also trigger an error because the contigs should not have the separator (actually, I just noticed that we don't check the sample names, but we should)

Mar 20 '25 22:03 luispedro

Thanks for the reply. So if i input contigs in this format <sample_name><contig_name>, i.e. without a separator, then i should not get the error?

Mar 21 '25 07:03 ah140797

Can you use

SemiBin2 concatenate_fasta ...

to create the concatenated file with a separator that does not exist in neither the contig nor the sample names? If that causes a problem, then it's a bona fides SemiBin2 bug

Mar 21 '25 07:03 luispedro

Hi again, i tried with SemiBin2 concatenate_fasta, and now it works, thanks!

I want to use the bins that are created before re-clustering, and hence use the flag as below:

SemiBin2 multi_easy_bin -i catalogue -b ban-files -o output  --write-pre-reclustering-bins

However, in the output I only get the folder bins, which I assume are the final bins after reclustering. Is there a way to obtain the bins prior to reclustring in multi-samples binning?

Apr 14 '25 07:04 ah140797

Check inside each sample, e.g., output/samples/*/output_prerecluster_bins

Apr 21 '25 03:04 luispedro

Thanks for the answer! I assume that the bins in output/samples/*/output_prerecluster_bins are specific to each sample. How do i obtain binning results for all samples? Is this simply the bins across all samples?

Thanks!

Apr 23 '25 11:04 ah140797

Yes, exactly

For questions about SemiBin that are not bug reports, we prefer if you use the mailing list (https://groups.google.com/g/semibin-users) as it also potentially benefit other users

Apr 23 '25 23:04 luispedro