OpenLRM icon indicating copy to clipboard operation
OpenLRM copied to clipboard

Inference result quality on trained model is not good

Open hayoung-jeremy opened this issue 10 months ago • 7 comments

information

device info

  • tested on Runpod GPU instance
  • A100 SXM 80GB x3, 96 vCPU 750 GB RAM
  • runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04

data preparation

  • prepared 100 high quality glb files
  • successfully generated dataset with rgba, pose, and intrinsics.npy using blender_script.py.
  • splited them into two parts, each for training(80%) and evaluation(20%), configured train_uids,json and val_uids.json.

training

modified epochs (60 => 1000) and global_step_period (1000 => 100) train_sample.yaml :


experiment:
    type: lrm
    seed: 42
    parent: lrm-objaverse
    child: small-dummyrun

model:
    camera_embed_dim: 1024
    rendering_samples_per_ray: 96
    transformer_dim: 512
    transformer_layers: 12
    transformer_heads: 8
    triplane_low_res: 32
    triplane_high_res: 64
    triplane_dim: 32
    encoder_type: dinov2
    encoder_model_name: dinov2_vits14_reg
    encoder_feat_dim: 384
    encoder_freeze: false

dataset:
    subsets:
        -   name: objaverse
            root_dirs:
                - "/root/OpenLRM/views"
            meta_path:
                train: "/root/OpenLRM/train_uids.json"
                val: "/root/OpenLRM/val_uids.json"
            sample_rate: 1.0
    sample_side_views: 3
    source_image_res: 224
    render_image:
        low: 64
        high: 192
        region: 64
    normalize_camera: true
    normed_dist_to_center: auto
    num_train_workers: 4
    num_val_workers: 2
    pin_mem: true

train:
    mixed_precision: bf16  # REPLACE THIS BASED ON GPU TYPE
    find_unused_parameters: false
    loss:
        pixel_weight: 1.0
        perceptual_weight: 1.0
        tv_weight: 5e-4
    optim:
        lr: 4e-4
        weight_decay: 0.05
        beta1: 0.9
        beta2: 0.95
        clip_grad_norm: 1.0
    scheduler:
        type: cosine
        warmup_real_iters: 3000
    batch_size: 16  # REPLACE THIS (PER GPU)
    accum_steps: 1  # REPLACE THIS
    epochs: 1000  # REPLACE THIS
    debug_global_steps: null

val:
    batch_size: 4
    global_step_period: 100
    debug_batches: null

saver:
    auto_resume: true
    load_model: null
    checkpoint_root: ./exps/checkpoints
    checkpoint_global_steps: 1000
    checkpoint_keep_level: 5

logger:
    stream_level: WARNING
    log_level: INFO
    log_root: ./exps/logs
    tracker_root: ./exps/trackers
    enable_profiler: false
    trackers:
        - tensorboard
    image_monitor:
        train_global_steps: 100
        samples_per_log: 4

compile:
    suppress_errors: true
    print_specializations: true
    disable: true

training result as follows :

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.09it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.19it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.05it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.23it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.11it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.12it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.13it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.14it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.27it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.03it/s]
[TRAIN STEP]loss=0.369, loss_pixel=0.033, loss_perceptual=0.335, loss_tv=3.03, lr=0.000133: 100%|█████████████████████████████████████████████████████| 1000/1000 [1:26:39<00:00,  5.78s/it]
root@b5f5ee77bf34:~/OpenLRM# 

converted the generated checkpoint into huggingface-compatible model using the following script :

python scripts/convert_hf.py --config ./configs/train-sample.yaml  convert.global_step=null

succesfully generated model :

root@b5f5ee77bf34:~/OpenLRM# python scripts/convert_hf.py --config ./configs/train-sample.yaml  convert.global_step=null
/root/OpenLRM/./openlrm/models/encoders/dinov2/layers/swiglu_ffn.py:43: UserWarning: xFormers is available (SwiGLU)
  warnings.warn("xFormers is available (SwiGLU)")
/root/OpenLRM/./openlrm/models/encoders/dinov2/layers/attention.py:27: UserWarning: xFormers is available (Attention)
  warnings.warn("xFormers is available (Attention)")
/root/OpenLRM/./openlrm/models/encoders/dinov2/layers/block.py:39: UserWarning: xFormers is available (Block)
  warnings.warn("xFormers is available (Block)")
Downloading: "https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_reg4_pretrain.pth" to /root/.cache/torch/hub/checkpoints/dinov2_vits14_reg4_pretrain.pth
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 84.2M/84.2M [00:00<00:00, 192MB/s]
Loading from exps/checkpoints/lrm-objaverse/small-dummyrun/000100/model.safetensors
Saving locally to exps/releases/lrm-objaverse/small-dummyrun/step_000100
root@b5f5ee77bf34:~/OpenLRM# 

inference

started inference on trained-model :

EXPORT_VIDEO=true
EXPORT_MESH=true
MODEL_NAME="./exps/releases/lrm-objaverse/small-dummyrun/step_001000"

python -m openlrm.launch infer.lrm --infer "./configs/infer-l.yaml" model_name=$MODEL_NAME image_input="./assets/sample_input/test.png" export_video=$EXPORT_VIDEO export_mesh=$EXPORT_MESH

the input image is from one of the multi view images from the data used for training :

and the output ply look likes this :

Why does the result quality is so bad? What should I do to increase the quality? I'm so new to AI, don't know how to properly generate checkpoint models modifying config files. Should I increase the dataset amount? Should I increase the training epoch more than 1000, or decrease it? So confused and not know where to start. If someone could give me advise, it would be great help. Thank you in advance:)

hayoung-jeremy avatar Apr 18 '24 08:04 hayoung-jeremy

With only 100 training examples, you should just try using the pretrained models. There are steps for how to use them for inference in the readme.

Training a model from scratch requires a lot of data - in the original LRM paper they used something like 700k unique objects.

SamBahrami avatar Apr 19 '24 10:04 SamBahrami

Hi, when i try to overfit the one object and train for 100 epochs to give it a simple test, and the output video is blank. I'm pretty confused, do you have the same issue? I didn't try the mesh because of this https://github.com/3DTopia/OpenLRM/issues/28#issuecomment-2067842998. I would greatly appreciate any advice you can provide!

JINNMnm avatar Apr 23 '24 05:04 JINNMnm

Thank you for reply @SamBahrami . I want to try finetuning on the pretrained model, but don't know how to set up configs properly. There seems not to be any parameters setting up pretrained model in configs/train_sample.yaml or configs/accelerate_train.yaml. If I want to finetune the LRM's pretrained model with my few dataset, how can I do that? Thank you for your help in advance.

hayoung-jeremy avatar Apr 23 '24 06:04 hayoung-jeremy

Hi, when i try to overfit the one object and train for 100 epochs to give it a simple test, and the output video is blank. I'm pretty confused, do you have the same issue? I didn't try the mesh because of this #28 (comment). I would greatly appreciate any advice you can provide!

Try training for even longer. If your dataset is really tiny, it may take a long time to converge to anything at all. Try like 10000 epochs or something and see if that overfits. Also consider setting the perceptual loss lower, something like 0.2. I got some decent results overfitting on 2 objects with that kind of setup.

Thank you for reply @SamBahrami . I want to try finetuning on the pretrained model, but don't know how to set up configs properly. There seems not to be any parameters setting up pretrained model in configs/train_sample.yaml or configs/accelerate_train.yaml. If I want to finetune the LRM's pretrained model with my few dataset, how can I do that? Thank you for your help in advance.

I haven't tried finetuning, not sure how to do that within this codebase. Have you tried the base model itself without finetuning?

SamBahrami avatar Apr 23 '24 07:04 SamBahrami

Thanks for your advice @SamBahrami! I'll try the setup right now:)

JINNMnm avatar Apr 23 '24 11:04 JINNMnm

Hi @SamBahrami , I think I found a way to load a base model and finetune on it. As you can see below, the configs/train_sample.yaml file has load_model parameter on saver section :

...

val:
    batch_size: 4
    global_step_period: 1000
    debug_batches: null

saver:
    auto_resume: true
    load_model: null # modify here such as "/root/OpenLRM/base_models/model.safetensors"
    checkpoint_root: ./exps/checkpoints
    checkpoint_global_steps: 1000
    checkpoint_keep_level: 5
...

So I manually downloaded model.safetensors file from huggingface(e.g. openlrm-mix-large-1.1), and placed it under the following path : /root/OpenLRM/base_models/model.safetensors.

Also, you have to modify model section from the configs/train_sample.yaml file referring to the config.json from the huggingface model(below is an example case) :

...

# adjusted all the parameters from model section below :
model:
    camera_embed_dim: 1024
    rendering_samples_per_ray: 128
    transformer_dim: 1024
    transformer_layers: 16
    transformer_heads: 16
    triplane_low_res: 32
    triplane_high_res: 64
    triplane_dim: 80
    encoder_type: dinov2
    encoder_model_name: dinov2_vitb14_reg
    encoder_feat_dim: 768
    encoder_freeze: false

dataset:
    subsets:
        -   name: objaverse
            root_dirs:
                - "/root/OpenLRM/views"
            meta_path:
                train: "/root/OpenLRM/train_uids.json"
                val: "/root/OpenLRM/val_uids.json"
            sample_rate: 1.0
    sample_side_views: 3
    source_image_res: 224
    render_image:
        low: 64
        high: 192
        region: 64
    normalize_camera: true
    normed_dist_to_center: auto
    num_train_workers: 4
    num_val_workers: 2
    pin_mem: true

train:
    mixed_precision: bf16 
    find_unused_parameters: false
    loss:
        pixel_weight: 1.0
        perceptual_weight: 1.0
        tv_weight: 5e-4
    optim:
        lr: 4e-4
        weight_decay: 0.05
        beta1: 0.9
        beta2: 0.95
        clip_grad_norm: 1.0
    scheduler:
        type: cosine
        warmup_real_iters: 3000
    batch_size: 2  # I decreased it from 16 to 2, since my GPU instance's memory is not efficient -- L40S x6
    accum_steps: 1 
    epochs: 1000  # I've tried 1000 epochs
    debug_global_steps: null

val:
    batch_size: 4
    global_step_period: 1000
    debug_batches: null

saver:
    auto_resume: true
    load_model: "/root/OpenLRM/base_models/model.safetensors"
    checkpoint_root: ./exps/checkpoints
    checkpoint_global_steps: 1000
    checkpoint_keep_level: 5

...

Now I'm trying finetuning based on the pretrained model! I'll share the result when it finished.

hayoung-jeremy avatar Apr 25 '24 07:04 hayoung-jeremy

Below is the result of the inference based on the finetuned model from the OpenLRM's base model(openlrm-mix-large-1.1).

  • I've tried with my custom 400 data pairs, which are copied from the original 100 pairs 3 times.
  • trained on Runpod's L40S x6 instance
  • inference tested on Runpod's A100 SXM instance

The images show ground truth model(left), inference result from the base model(middle), and the result from the fine-tuned model(right). All of the input images used for inference were taken from the training data :

image image image image image image image image image image image image image image image

I'm planning to try more epochs to overfit the model, since I cannot increase the number of the data pairs immediately. Thank you for your great help, @SamBahrami !

hayoung-jeremy avatar Apr 25 '24 08:04 hayoung-jeremy

怎么制作自己的数据集(How to make your own dataset)

wensir66666 avatar May 18 '24 06:05 wensir66666

@hayoung-jeremy Hello! At present, I plan to train the model, but since I am new to 3D content, could you share your training steps, such as how to make train_uids.json and val_uids.json? I have many pictures and do not know how to make a dataset, I hope to get your help

Mrguanglei avatar May 22 '24 03:05 Mrguanglei

Hi @Mrguanglei and @wensir66666 , belows are steps I've tried :

  1. prepare dataset using the blender script you must first install the blender, then run the script as follows, then it will automatically create views folder containing your data :

    blender -b -P scripts/data/objaverse/blender_script.py -- --object_path ./path/to/your/glb
    

    Below is the shell code I've used for iterating the blender script :

    #!/bin/bash
    
    # Record the start time and convert to date and time
    start_time=$(date +%s)
    start_date=$(date)
    
    DIRECTORY="./data"
    for glb_file in $DIRECTORY/*.glb; do
      echo "Processing $glb_file"
      blender -b -P scripts/data/objaverse/blender_script.py -- --object_path $glb_file
    done
    
    # Record the end time and convert to date and time
    end_time=$(date +%s)
    end_date=$(date)
    
    # Calculate the total duration
    elapsed=$((end_time - start_time))
    
    # Convert the total duration to hours, minutes, and seconds
    hours=$((elapsed / 3600))
    minutes=$(( (elapsed % 3600) / 60))
    seconds=$((elapsed % 60))
    
    # Print the execution results
    echo "Start time: $start_date"
    echo "End time: $end_date"
    echo -n "Total time elapsed: "
    
    # Print only if hours, minutes, or seconds are not zero
    if [ $hours -gt 0 ]; then
      echo -n "$hours hours "
    fi
    
    if [ $minutes -gt 0 ] || [ $hours -gt 0 ]; then # Display minutes if there are hours
      echo -n "$minutes minutes "
    fi
    
    if [ $seconds -gt 0 ] || [ $minutes -gt 0 ] || [ $hours -gt 0 ]; then # Display seconds if there are minutes
      echo "$seconds seconds"
    fi
    
    echo ""
    
  2. customize the config files (train-smple.yaml, accelerate-train.yaml) based on your environment(how many GPUs you're using, whether you use base model or not, train steps, epochs, batch sizes etc.) below is my case, I've added comments to all the parts I modified.

    # train-sample.yaml
    
    experiment:
        type: lrm
        seed: 42
        parent: lrm-objaverse
        child: small-dummyrun
    
    model:
        camera_embed_dim: 1024
        rendering_samples_per_ray: 96
        transformer_dim: 512
        transformer_layers: 12
        transformer_heads: 8
        triplane_low_res: 32
        triplane_high_res: 64
        triplane_dim: 32
        encoder_type: dinov2
        encoder_model_name: dinov2_vits14_reg
        encoder_feat_dim: 384
        encoder_freeze: false
    
    dataset:
        subsets:
            -   name: objaverse
                root_dirs:
                    - "/root/OpenLRM/views" # it will be the path to your dataset folder
                meta_path:
                    train: "/root/OpenLRM/train_uids.json" # you have to create your own json files, I have described how below
                    val: "/root/OpenLRM/val_uids.json" # you have to create your own json files, I have described how below
                sample_rate: 1.0
        sample_side_views: 3
        source_image_res: 224
        render_image:
            low: 64
            high: 192
            region: 64
        normalize_camera: true
        normed_dist_to_center: auto
        num_train_workers: 4
        num_val_workers: 2
        pin_mem: true
    
    train:
        mixed_precision: bf16 
        find_unused_parameters: false
        loss:
            pixel_weight: 1.0
            perceptual_weight: 1.0
            tv_weight: 5e-4
        optim:
            lr: 4e-4
            weight_decay: 0.05
            beta1: 0.9
            beta2: 0.95
            clip_grad_norm: 1.0
        scheduler:
            type: cosine
            warmup_real_iters: 3000
        batch_size: 4  # REPLACE THIS (PER GPU), I've modified it from 16 to 4
        accum_steps: 1 
        epochs: 1000  # REPLACE THIS, I've modified it from 60 to 1000
        debug_global_steps: null
    
    val:
        batch_size: 4
        global_step_period: 1000
        debug_batches: null
    
    saver:
        auto_resume: true
        load_model: null # If you want to load a base model, describe it here
        checkpoint_root: ./exps/checkpoints
        checkpoint_global_steps: 1000
        checkpoint_keep_level: 5
    
    logger:
        stream_level: WARNING
        log_level: INFO
        log_root: ./exps/logs
        tracker_root: ./exps/trackers
        enable_profiler: false
        trackers:
            - tensorboard
        image_monitor:
            train_global_steps: 100
            samples_per_log: 4
    
    compile:
        suppress_errors: true
        print_specializations: true
        disable: true
    
    # accelerate-train.sample
    
    compute_environment: LOCAL_MACHINE
    debug: false
    distributed_type: MULTI_GPU
    downcast_bf16: 'no'
    gpu_ids: all
    machine_rank: 0
    main_training_function: main
    mixed_precision: bf16
    num_machines: 1
    num_processes: 8 # replace this with the number of your GPUs, I was using an instance with 8 GPUs
    rdzv_backend: static
    same_network: true
    tpu_env: []
    tpu_use_cluster: false
    tpu_use_sudo: false
    use_cpu: false
    
    

    Below is how I did to create json files for training and validating. I've created JSON files by dividing them into 80% for training and 20% for validation :

    import os
    import json
    import random
    
    directory_list = os.listdir('./views')
    directories = [dir for dir in directory_list if os.path.isdir(os.path.join('./views', dir))]    
    random.shuffle(glb_files)
    
    split_index = int(0.8 * len(directories))
    train_dirs = directories[:split_index]
    val_dirs = directories[split_index:]
    
    with open('./train_uids.json', 'w') as f:
        json.dump(train_dirs, f, indent=4)
    
    with open('./val_uids.json', 'w') as f:
        json.dump(val_dirs, f, indent=4)
    
  3. then run the training code

    accelerate launch --config_file ./configs/accelerate-train.yaml -m openlrm.launch train.lrm --config ./configs/train-sample.yaml
    

hayoung-jeremy avatar May 22 '24 05:05 hayoung-jeremy

hi, if i want to train the small OpenLRM, what config should I use? Thanks!

ChendiDotLin avatar Oct 06 '24 07:10 ChendiDotLin