ValueError: num_samples should be a positive integer value, but got num_samples=0
Hi,
I am encountering an error when running Semibin2.
I am using the multi-split approach on the CAMI2 airways dataset.
First I concatenate contigs using the script from VAMB:
concatenate.py catalogue.fna.gz path_to_contigs/
Then running strobealign to obtain a bam file for each sample.
strobealign catalogue.fna.gz path_to_reads/ | samtools sort -o path_to_bams/sample.bam
I then run SemiBin2 in with multi_easy_bin, using the separator C.
SemiBin2 multi_easy_bin -i catalogue.fna -b path_to_bams/*.bam -o output_semibin2 --separator C
And i get the following error when running Semibin2. It happens during training in the DataLoader.
2025-03-18 10:10:20 j-5191716-job-0 SemiBin[3443] INFO Setting number of CPUs to 192
2025-03-18 10:10:20 j-5191716-job-0 SemiBin[3443] INFO Binning for short_read
2025-03-18 10:10:20 j-5191716-job-0 SemiBin[3443] INFO SemiBin will run in self supervised mode
2025-03-18 10:10:22 j-5191716-job-0 SemiBin[3443] INFO Running with GPU.
2025-03-18 10:10:22 j-5191716-job-0 SemiBin[3443] INFO Performing multi-sample binning
2025-03-18 10:10:22 j-5191716-job-0 SemiBin[3443] INFO Generating training data...
2025-03-18 10:10:29 j-5191716-job-0 SemiBin[3443] INFO Calculating coverage for every sample.
2025-03-18 10:12:40 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/10_sorted.bam
2025-03-18 10:12:40 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/11_sorted.bam
2025-03-18 10:12:40 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/12_sorted.bam
2025-03-18 10:12:40 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/23_sorted.bam
2025-03-18 10:12:41 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/26_sorted.bam
2025-03-18 10:12:41 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/27_sorted.bam
2025-03-18 10:12:41 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/4_sorted.bam
2025-03-18 10:12:41 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/7_sorted.bam
2025-03-18 10:12:41 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/8_sorted.bam
2025-03-18 10:12:41 j-5191716-job-0 SemiBin[3443] INFO Processed: data/cami2/airways_short/9_sorted.bam
2025-03-18 10:21:32 j-5191716-job-0 SemiBin[3443] INFO Training model and clustering for S1.
2025-03-18 10:21:32 j-5191716-job-0 SemiBin[3443] INFO Start training from a single sample.
2025-03-18 10:21:33 j-5191716-job-0 SemiBin[3443] INFO Training model...
0%| | 0/15 [00:00<?, ?it/s]/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:3860: RuntimeWarning: Mean of empty slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/numpy/_core/_methods.py:147: RuntimeWarning: invalid value encountered in scalar divide
ret = ret / rcount
0%| | 0/15 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/work/miniconda3/envs/SemiBint/bin/SemiBin2", line 10, in <module>
sys.exit(main2())
^^^^^^^
File "/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/SemiBin/main.py", line 1610, in main2
multi_easy_binning(
File "/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/SemiBin/main.py", line 1349, in multi_easy_binning
training(logger, None, args.num_process,
File "/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/SemiBin/main.py", line 1126, in training
model = train_self(logger,
^^^^^^^^^^^^^^^^^^
File "/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/SemiBin/self_supervised_model.py", line 109, in train_self
train_loader = DataLoader(
^^^^^^^^^^^
File "/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 383, in __init__
sampler = RandomSampler(dataset, generator=generator) # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work/miniconda3/envs/SemiBint/lib/python3.12/site-packages/torch/utils/data/sampler.py", line 165, in __init__
raise ValueError(
ValueError: num_samples should be a positive integer value, but got num_samples=0
This is my environment for semibin:
name: SemiBin
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python
- pip
- semibin=2.1.0
- pip:
- torch
- torchvision
- torchaudio
Do you have any idea on how to fix it?
Thanks a lot in advance.
Best Regards Anders
Is the separator C the same that the concatenate.py file uses? It's important that they be consistent
Thanks for the reply.
Yes the separator C is the same as used in the file concatenate.py.
The contigs headers from catalogue.fna.gz are in this format:
<sample_name>C<original_contig_name>
S8CS23C153210
S8CS23C73612
The contigs names also have the separator? That is normally not good (and the internal SemiBin2 concatenate_fasta command would have checked for that and errored out).
Yes the contigs names also have the separator C. Both in the catalogue.fna.gz and in the bam-files.
I understand that the Semibin2 concatenate_fasta uses the separator : to obtain this format: <sample_name>:<original_contig_name>, and that using the argument --separator C would make it work for the format I have.
Yes, but it would also trigger an error because the contigs should not have the separator (actually, I just noticed that we don't check the sample names, but we should)
Thanks for the reply. So if i input contigs in this format <sample_name><contig_name>, i.e. without a separator, then i should not get the error?
Can you use
SemiBin2 concatenate_fasta ...
to create the concatenated file with a separator that does not exist in neither the contig nor the sample names? If that causes a problem, then it's a bona fides SemiBin2 bug
Hi again, i tried with SemiBin2 concatenate_fasta, and now it works, thanks!
I want to use the bins that are created before re-clustering, and hence use the flag as below:
SemiBin2 multi_easy_bin -i catalogue -b ban-files -o output --write-pre-reclustering-bins
However, in the output I only get the folder bins, which I assume are the final bins after reclustering. Is there a way to obtain the bins prior to reclustring in multi-samples binning?
Check inside each sample, e.g., output/samples/*/output_prerecluster_bins
Thanks for the answer! I assume that the bins in output/samples/*/output_prerecluster_bins are specific to each sample. How do i obtain binning results for all samples? Is this simply the bins across all samples?
Thanks!
Yes, exactly
For questions about SemiBin that are not bug reports, we prefer if you use the mailing list (https://groups.google.com/g/semibin-users) as it also potentially benefit other users