OpenSDI icon indicating copy to clipboard operation
OpenSDI copied to clipboard

Does the author consider releasing weights? I use your code to reproduce but the effect is poor.

Open githuboflk opened this issue 6 months ago • 5 comments

Image

githuboflk avatar Jul 02 '25 06:07 githuboflk

Image

The results on SD15.

githuboflk avatar Jul 02 '25 06:07 githuboflk

Image

The results on SD15.

I am also trying to reproduce the performance in the paper, but my results are even worse.

The performance of MVSSNet remained around 0.2, and I attempted to train MaskCLIP; however, the loss became None at epoch 8. Would you mind sharing your training parameters? I trained under the IMDLBenCo framework, and is there anything I should change?

I would appreciate it if you could share your experience.

Jenna-Bai avatar Jul 06 '25 14:07 Jenna-Bai

@Jenna-Bai It actually works poorly when I reproduce it entirely using the author's code (IMDLBenco). So I rewrote part of the design. But it still cannot achieve the performance of the paper. Looking forward to the author's reply.

githuboflk avatar Jul 07 '25 15:07 githuboflk

Hi @githuboflk and @Jenna-Bai,

Apologies for the delayed response; I've been occupied with my graduation work lately.

Thank you for raising this issue. I took the last couple of days to re-run the experiments on an AutoDL instance to investigate the reproduction difficulties.

A key piece of context is that the original experiments were conducted on the H100 cluster, but now I can't access it. This means the train.sh script in the repository might not be perfectly tuned for other hardware configurations. After some testing, I've confirmed that the most critical hyperparameters for successful reproduction are learning rate, batch_size and image_size.

I was able to achieve the reported performance with the following parameters. Please note that this configuration requires a significant amount of VRAM, and I ran it on a server with 8x 4090 GPUs.

torchrun  \
    --standalone    \
    --nnodes=1     \
    --nproc_per_node=8 \
./train.py \
    --exp_name MaskCLIP_sd15 \
    --model_setting_name 'ViTL' \
    --model MaskCLIP \
    --world_size 1 \
    --batch_size 8 \
    --data_path "nebula/OpenSDI_train" \
    --epochs 20 \
    --lr 5e-5 \
    --image_size 512 \
    --if_resizing \
    --min_lr 0 \
    --weight_decay 0.05 \
    --edge_mask_width 7 \
    --if_predict_label \
    --if_not_amp \
    --test_data_path "nebula/OpenSDI_test" \
    --warmup_epochs 0 \
    --output_dir "./output_dir" \
    --log_dir "./output_dir" \
    --accum_iter 1 \
    --seed 42 \
    --test_period 1

While randomness across different machines can cause slight variations, when the combined loss drops below 0.25, the model's performance is generally on par with the results in the paper.

Regarding your question about the pretrained weights: at the request of my co-author, we are holding off on releasing them for the time being. We ask for your patience on this matter.

iamwangyabin avatar Jul 08 '25 16:07 iamwangyabin

@Jenna-Bai

Try following commands for other methods, you may need change a little:


#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --partition=swarm_a100
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:4
#SBATCH --time=60:00:00
#SBATCH --mem=640000

eval "$(conda shell.bash hook)"
conda init bash
conda activate test
module load gcc/14.2.0

base_dir="./output_dir"
mkdir -p ${base_dir}

#CUDA_VISIBLE_DEVICES=0,1,2,3 \
torchrun  \
    --standalone    \
    --nnodes=1     \
    --nproc_per_node=4 \
./train.py \
    --exp_name mvss_sd15 \
    --model MVSSNet \
    --world_size 1 \
    --batch_size 32 \
    --epochs 25 \
    --lr 2e-5 \
    --image_size 512 \
    --if_not_amp \
    --find_unused_parameters \
    --no_model_eval \
    --if_resizing \
    --min_lr 5e-7 \
    --weight_decay 0.05 \
    --edge_mask_width 7 \
    --warmup_epochs 1 \
    --output_dir ${base_dir}/ \
    --log_dir ${base_dir}/ \
    --accum_iter 1 \
    --seed 42 \
    --test_period 1 


base_dir="./output_dir"
torchrun  \
    --standalone    \
    --nnodes=1     \
    --nproc_per_node=8 \
./train.py \
    --exp_name Trufor_sd15 \
    --model Trufor \
    --world_size 1 \
    --batch_size 8 \
    --data_path "nebula/tmpdata" \
    --np_pretrain_weights "weights/noiseprint.pth" \
    --mit_b2_pretrain_weights "weights/mit_b2.pth" \
    --config_path "./configs/trufor.yaml" \
    --phase 3 \
    --det_resume_ckpt "/scratch/yw26g23/sam/output_dir/Trufor_sd15_20241107_00_29_48/Trufor_sd15_20241107_00_29_48.pth" \
    --epochs 10 \
    --lr 1e-5 \
    --if_predict_label \
    --if_not_amp \
    --find_unused_parameters \
    --image_size 512 \
    --if_resizing \
    --min_lr 0 \
    --weight_decay 0.05 \
    --edge_mask_width 7 \
    --test_data_path "nebula/tmpdata3" \
    --warmup_epochs 0 \
    --output_dir ${base_dir}/ \
    --log_dir ${base_dir}/ \
    --accum_iter 1 \
    --seed 42 \
    --test_period 1

I'm not sure what's causing the loss NaN or the low MVSSNet performance (0.2) on your end, as there's too little information to diagnose. However, I can confirm that when I recently cloned this repository directly on AutoDL and ran train.sh, I didn't encounter these specific issues.

iamwangyabin avatar Jul 08 '25 16:07 iamwangyabin