SemiBin icon indicating copy to clipboard operation
SemiBin copied to clipboard

Data shape error in the training step for multi_easy_bin for long reads

Open jblakele opened this issue 5 months ago • 1 comments

Hi,

I am trying to use SemiBin2 to bin some assemblies from Nanopore sequencing, but I am encountering a data shape error at the training step after the coverage and split files have been produced. I am using SemiBin2 version 2.2.0 installed from bioconda. Assemblies were produced by metaFlye and bam files were produced using the version of minimap2 that comes with Flye. Any help would be appreciated.

Here are the steps and a screenshot of the log.

Concatenated assemblies from metaFlye SemiBin2 concatenate_fasta -o CombinedAssembly.fasta -i BC33/BC33Assembly.fasta BC34/BC34Assembly.fasta BC35/BC35Assembly.fasta BC36/BC36Assembly.fasta

Mapped reads using minimap2 flye-minimap2 -L -t 10 -x map-ont -a CombinedAssembly.fasta/concatenated.fa ../DJ2060-P05-BC33.fastq.gz | samtools sort -o CombinedBC33.bam --write-index flye-minimap2 -L -t 10 -x map-ont -a CombinedAssembly.fasta/concatenated.fa ../DJ2060-P05-BC34.fastq.gz | samtools sort -o CombinedBC34.bam --write-index flye-minimap2 -L -t 10 -x map-ont -a CombinedAssembly.fasta/concatenated.fa ../DJ2060-P05-BC35.fastq.gz | samtools sort -o CombinedBC35.bam --write-index flye-minimap2 -L -t 10 -x map-ont -a CombinedAssembly.fasta/concatenated.fa ../DJ2060-P05-BC36.fastq.gz | samtools sort -o CombinedBC36.bam --write-index

multi_easy_bin SemiBin2 multi_easy_bin --sequencing-type long_read -t 10 -i CombinedAssembly.fasta/concatenated.fa -o SemiBinMulti -b CombinedBC33.bam CombinedBC34.bam CombinedBC35.bam CombinedBC35.bam –verbose

Image

jblakele avatar Jul 23 '25 13:07 jblakele

I faced the same error several times:

025-09-07 00:32:23 vls15-slurm.compbio.ulaval.ca SemiBin2[3283485] INFO Training model and clustering for sample "SRR3187095_scaffolds"
2025-09-07 00:32:23 vls15-slurm.compbio.ulaval.ca SemiBin2[3283485] INFO Start training from a single sample.
2025-09-07 00:32:23 vls15-slurm.compbio.ulaval.ca SemiBin2[3283485] INFO Training model...
  0%|          | 0/15 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/guelou01/miniconda3/envs/SemiBin2/bin/SemiBin2", line 10, in <module>
    sys.exit(main2())
             ~~~~~^^
  File "/home/guelou01/miniconda3/envs/SemiBin2/lib/python3.13/site-packages/SemiBin/main.py", line 1635, in main2
    multi_easy_binning(
    ~~~~~~~~~~~~~~~~~~^
        logger,
        ^^^^^^^
        args,
        ^^^^^
        device)
        ^^^^^^^
  File "/home/guelou01/miniconda3/envs/SemiBin2/lib/python3.13/site-packages/SemiBin/main.py", line 1361, in multi_easy_binning
    training(logger, sample_fasta,
    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
             [sample_data], [sample_data_split], sample_cannot,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
             mode='single',
             ^^^^^^^^^^^^^^
             args=args)
             ^^^^^^^^^^
  File "/home/guelou01/miniconda3/envs/SemiBin2/lib/python3.13/site-packages/SemiBin/main.py", line 1144, in training
    model = train_self(logger,
                       data,
    ...<5 lines>...
                       args.num_process,
                       mode)
  File "/home/guelou01/miniconda3/envs/SemiBin2/lib/python3.13/site-packages/SemiBin/self_supervised_model.py", line 81, in train_self
    train_data_split = train_data_split / norm
                       ~~~~~~~~~~~~~~~~~^~~~~~
ValueError: operands could not be broadcast together with shapes (0,203) (339,) 

I checked that the sample contains scaffolds with the minimum length of 2500bp, as I met the error before and thought that I got it from samples that had no scaffolds with sufficient length. From memory this usually yields a warning and exit in single easy binning. This time I excluded all the bam files of samples with no scaffold of sufficient length, but the error persists, as shown above. The sample has a sequence over 2500bp. My concatenated.fasta file is a batch of 300 samples.

Is there something that I missed in the documentation, like I should have filtered the scaffolds by size before concatenating ?

EDIT: I tried the idea above; filtered all scaffolds files at 2500bp, concatenated, aligned the reads, started SemiBin2, I got a crash at the tenth sample.

Louis-MG avatar Sep 07 '25 22:09 Louis-MG